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Preface 


‘Any process that works can be understood; what cannot be understood is suspect.’ 
This is effectively what Marvin Minsky and Seymour Papert stated twenty years ago 
during the first wave of interest in artificial neural networks. This critical remark 
could be made again in the recent second wave of interest in the topic, because for 
many people neural networks still seem something of a black art. Perhaps this attitude 
has been engendered by the tremendous amount of recent publications on neural 
networks, where it seems that every author invents his or her own type of 
neural network, which is frequently only justified by some particular application. For 
the reader it must be a chaotic collection of seemingly unrelated questions without 
a consistent framework and a unifying perspective. We certainly do not claim that 
we can give such a unified framework at this stage of development of this new scientific 
field, but we can at least open the black boxes of the main types of neural networks 
that can be thoroughly understood and that have turned out to be very useful in a 
broad range of applications. 

A reason for the excitement about neural networks might be that in the literature 
on artificial neural networks one frequently encounters very promising and attractive 
statements about the generalization capability of neural networks, like: ‘Neural 
networks are capable of adapting themselves with the aid of a learning rule and a 
set of examples to model relationships among the data without any a priori 
assumptions about the nature of the relationships.’ A similar statement is: ‘After 
learning neural networks they may be used to predict characteristics of new samples 
or to derive empirical models from examples in situations in which no theoretically 
based model is known.’ Although to a certain extent these types of statements are 
true, one must be careful with the substatements that no model or a priori information 
about the nature of the relationship between examples is assumed. 

Generalization is the process of inductive inference of general relationships from 
a finite number of samples. An example is the inference of a new number in a finite 
sequence of numbers: one is inclined to say that the next example in sequence 
1, 4, 9, 16,... will be the number 25, because we observe a simple regularity in the 
sequence: the kth element in the sequence is k?. However, if no prejudice in favour 
of some type of ‘model’ exists, any number may follow the given sequence. For 
example, one might as well say that the next number is 27 because one is in favor of 
the regularity where the nth number y(n) in the sequence is defined by y(n)=‘sum 
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of first n uneven primes’. If one asks many people (who know elementary calculus!) 
to guess the next number in the sequence given above, almost all will say that the 
next number is 25. This phenomenon reveals the human attitude to select always the 
most ‘simple’ model to explain, a sequence of experimental observations. 

A neural network is capable: + modelling relationships among data by learning 
from the examples but-is always using some a priori set of models. ‘Modelling’ can 
be defined as the process of formulating a finite set of interrelated rules (or the 
construction of a finite set of interconnected meckanisms) by which one can generate 
or explain the (potentially infinite) set of observed data. The simplicity of the model 
is very subjective because it depends on the domain of knowledge of the person (or 
of the mechanical process) that is doing the modelling. 

If one is using complex neural networks to ‘model’ the relationships behind a given 
set of data, it is hard to demonstrate that one is using (or assuming) a priori information 
about the kind of relationship. We will, however, demonstrate in several sections of 
this book that one needs, or assumes, a priori information about the relationship 
between the given data in order to be justified in accepting the outcome of the learning 
process of a neural network. The a priori information one is using is determined by 
the type of neural network, the configuration of the neural network and the type of 
neural transfer functions. In the application of neural networks to the solution of 
real-life problems it is important to be aware of this phenomenon. For example, if 
we want to infer with a two-layer continuous Perceptron from a given data set an 
unknown functional relationship and select the number of first-layer neurons in a 
Perceptron too low, then the unknown function will be underfitted (i.e. will not go 
through all data points); if we select the number of neurons too high, then the unknown 
function will be overfitted (the realized function will go through the data points but 
will fluctuate wildly in between). 

This example, more extensively discussed in Section 3.13, shows that generalization 
by learning from examples and counterexamples is in general impossible without 
utilizing a priori knowledge about the properties of the function to be identified. 

If one uses a neural network to find the relation behind the data in a data set, one 
is in addition assuming that the data set 1s representative of the (unknown!) relation. 
Nevertheless, it must be said that by using neural networks one can solve in an 
optimal way certain problems that are hard to tackle by conventional methods. For 
instance, we will show that among the set of all classifiers that divide the n-dimensional 
input space by a n— 1 dimensional hyperplane, the single-ncuron Perceptron is an 
optimal classifier. 

We will concentrate in this book on these kinds of limitations and capabilitics of 
the main type of neural networks, rather than giving a review of the multitude of 
different neural networks. Our aim is that the reader should profoundly understand 
and be able to apply artificial neural networks to the solution of practical problems. 

Almost 70 per cent of all publications deal with the type of networks we will analyze 
in this book: the binary Perceptron, the continuous Perceptron and the self-organizing 
neural network, and when we consider the applications that turn out to be useful, 
this percentage is even higher. 
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The book is self-contained, which means that we will deliberately avoid the 


fashionable vicious circle of proving propositions by quoting the propositions of . 


other authors. The methods of analyzing neural networks are largely of a mathematical 
nature but the level of mathematical rigour in our exposition will nevertheless be 
low because we are far more concerned with providing insight and understanding 
than establishing a rigorous mathematical foundation. The required mathematical 
maturity is that of a typical final-year undergraduate student in electrical engineering 
or computer science. 

Understanding is the ability to transform new phenomena to a coherent simple 
structure of already well understood phenomena. For this reason we will give many 
illustrative examples and plausible arguments in terms of what is supposed to be well 
known by the reader. A great number of real-life applications will also contribute to 
this understanding and will show at the same time the powerful usefulness of neural 
networks. 

For several years we have taught courses at the University of Twente based on 
the material in this book. The book can be covered in a one-semester course. 

Each chapter concludes with some exercises. The lists of literature are far from 
complete because we only want to give the reader a map for the main routes in the 
bewildering and chaotic landscape of published material. The exercises are meant as 
a means to check one’s own understanding of the presented theory. 

While writing this book we benefited from the comments of many colleagues and 
from the experiments performed by our students. We would like to thank especially 
Philip de Bruin, Mark Bentum, Andre Beltman and Cuun Krugers-Dagneaux-Rikkers 
for their suggestions for improving the manuscript. 


Leo P. J. Veelenturf 
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INTRODUCTION 


1.1 Machines and brains 


For several years the author has been involved in research on pattern recognition. 
In that period he became aware of the tremendous range of sophisticated methods 
used to analyze and to recognize pictures by machines. The pattern recognition 
machines were equipped with large numbers of vast and complicated algorithms. The 
most advanced machines could, for instance, recognize a certain class of handwritten 
digits, but in spite of the sophisticated nature of these machines they were limited to 
recognizing those pictures that had been foreseen by the system builders as potential 
elements to be recognized in the future. For example, one can build machins to 
recognize handwritten capital ‘A’s but the system will fail to recognize a capital ‘A’ 
as given in Figure 1.1. It is surprising that human beings can recognize the letter in 
Figure 1.1 as an ‘A’ as it is very unlikely that one has ever seen the figure before. 

It is very unlikely that human beings compare the handwritten ‘A’ to some reference 
picture stored in their brain. Probably they know the characteristic features of an 
‘A’ and their perception is not disturbed by artefacts in the picture. The interference 
of artefacts in a picture will, however, destroy the correct classification by a 
programmed machine, which is probably not able to judge the importance of 
deviations from the preprogrammed standard features. : 

The way people have acquired the ability to recognize pictures can only be by 
experience. By trial and error they have learned to perform certain tasks. Machines 
do not learn, they are preprogrammed, and if they can learn they are restricted to 
certain classes of preprogrammed methods of learning. 

The lesson seems to be that the capability of learning is essential for more advanced 
and intelligent artificial machines. 

Another striking difference between machines and human beings is the 
‘computation time required for complicated tasks such as pattern recognition. 
Computers are extremely fast but it is hard to design machines that can recognize 
three-dimensional objects in real time, whereas humans, whose brains are composed 
of neurons switching about a million times slower than electronic components, can 
recognize old friends almost instantaneously. We know that computers perform their 
computations sequentially, step by step, whereas the human brain is processing the 
information in parallel. 
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Figure 1.1 A handwritten capital ‘A’ 


The second lesson for designing more intelligent systems seems to be the usc of 
the parallel processing of information. 

Man-made machines are built with a large number of different complicated 
functional building blocks; if one unit fails, the whole system collapses. The brain, 
however, is built out of a large number of, at least from a functional point of view, 
almost identical building bricks: the neurons. Many units may be destroyed without 
significantly changing the behaviour of the total system. 

This comparison between the behaviour and construction of artificial machines 
and the behaviour and physiological configuration of the human brain might give 
new ideas for developing morc intelligent machines. Artificial neural networks are 
the results of the first steps in this new direction for intelligent system design. 


1.2 The artificial neural network 


The building unit of a ncural network is a simplified model of what is assumed to 
be the functional behaviour of an organic neuron. The human brain contains about 
10! neurons. For almost all organic neurons one can distinguish anatomically roughly 
three different parts: a sct of incoming fibers (the dendrites), a cell body (the soma) 
and one outgoing fiber (the axon). For a simplified configuration see Figure 1.2. The 
axons divide up into different endings, each of which makes contact with other 
neurons. A neuron can receive up to 10000 inputs from other neurons. The bulb-like 
structures where fibers contact are called synapses. Electrical pulses can be gencrated 
by neurons (so-called neuron firing) and are transmitted along the axon to the 
synapses. When the electrical activity is transferred by the synapse to another ncuron, 
it may contribute to the excitation or inhibition of that neuron. The synapses play 
an important role because their transmission efficiency for electrical pulses from an 
axon to the dendrites (or somas) of other neurons can be changed depending on the 
‘profitability’ of that alteration. 

The learning ability of human beings is probably incorporated in the facility of 
changing the transmission efficiency of those synapses. Donald O. Hebb was among 
the first who postulated this mechanism in his book Organization of Behavior (1949): 
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Dendrites 





Synapse 


Axon 


Soma 


Figure 1.2 Simplified configuration of an organic neuron 





Figure 1.3 Artificial model of a neuron. 


‘When an axon of cell A is near enough to excite a cell B and repeatedly or persistently 
takes part in firing it, some growth process or metabolic change takes place in one 
or both cells such that A’s efficiency, as one of the cells firing B, is increased.’ The 
change of the synaptic transmission efficiency acts as a memory for past experiences. 

Ina simplified artificial model of a neuron (see Figure 1.3), the synaptic transmission 
efficiency is translated into a real number w; by which an input x; is multiplied before 
entering the neuron cell. The number w; is called the weight of input x;. The absence 
or presence of a train of electrical pulses in a real neural fiber is modelled by a 
variable x; which respectively may have the value zero or one. In that case we will say 
that we have a binary artificial neuron representing the ‘one-or-zero’ behaviour of a 
real neuron. 
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In reality, neurons may fire up to about 100 pulses in a second. We can model 
this gradual change in pulse train frequency by a variable x; which may have any 
value between zero and one. In that case we have a continuous artificial neuron. 

In almost all artificial models of neurons, all inputs x; are weighted by the synaptic 
transmission efficiency and are summed to one number s= $ x,w;. This weighted 
input s determines in some way or another the output value y of the artificial ncuron. 
In a binary neuron the output y will be one (the neuron is firing) if the weighted 
input exceeds some threshold T, and will be zero (the neuron is silent) if the 
weighted input is below the threshold. In a continuous artificial neuron, the output 
y may be some monotone increasing function (the frequency of the pulse train on 
the outgoing axon is gradually changing depending on s) of the weighted input s. 

The model of an artificial.neuron outlined above was first introduced by the 
neurophysiologist Warren McCulloch and the logician Walter Pitts in 1943. In a 
famous paper by the mathematician S. C. Kleene in 1951, it was shown that with 
the artificial neurons introduced by McCulloch and Pitts, one can build a system 
that behaves in the same way as a computer. Although it is important to know that 
artificial neural nets are not inferior in their computation capabilities to computers, 
it is of no practical use to mimic computers. The benefits of an artificial neural 
network are mainly the results of the modifiability of behaviour by changing the 
weights w; in a learning proccss. . 

The learning behaviour of artificial neural nets was first treated extensively in a 
book by Frank Rosenblatt in 1962, Principles of Neurodynamics. He introduced a 
learning algorithm by which the weights can be changed such that a desired 
computation was performed. The wave of activity on artificial neural networks in 
the mid-1960s was, however, challenged in 1969 by Marvin Minsky and Seymour 
Papert, who showed in their book Perceptrons that some simple computations 
cannot be done with a one-layer neural net, and doubted that a learning algorithm 
could be found for multi-layer neural networks. At that time many scientists left the 
field of artificial neural networks. 

A second upheaval took place in the mid-1980s when several people found a 
learning algorithm, called the back-propagation algorithm, that could adjust the weights 
in multi-layer neural nets. About that time also new types of neural net with dynamic 
behaviour were introduced. We mention the neural net of Hopficld (1982) (not treated 
in this book) and the self-organizing neural net of Kohonen (1982). 

Dynamic neural nets are characterized by feedback: the output ofa neuron depends, 
after some delay, on its own output because of the fully interconnected structure of 
the neural nets. The binary Perceptron we will discuss in Chapter 2 and the continuous 
Perceptron of Chapter 3 are not dynamic systems: after learning, the output y(t) of 
the ncural net will only depend on the actual input x(r) and not on previous inputs. 
However, the input-output behaviour F: y(t)= F{x(t)}, learned by the neural net in 
the training phase, will depend on the sample input-output behaviour of the training 
set. 

For a dynamic system the actual output y(t) depends not only on the actual input 
x(t) but also on the actual state q(t): y(t) = Fiq(t), x(t}, whereas the state depends on 
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inputs in the past. There exist neural networks that can learn dynamic behaviour 
from sequences of inputs and corresponding sequences of outputs (Veelenturf, 1981). 
We will not discuss these networks because they are a typical example of solving 
problems by using a neural network, whereas, except for special cases, there are more 
efficient methods of finding the solutions with conventional methods (Veelenturf, 
1978). This observation leads to the warning that one must be aware that using neural 


networks is not a panacea; it is frequently better to have recourse to more conventional 
methods. 
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ce A 
THE BINARY PERCEPTRON 


2.1 Introduction 


The behaviour of an artificial neuron is inspired by the assumed behaviour of a real 
neuron in organic neural networks. A simplified model of a real neuron is composed 
of a cell body or soma, a set of fibers entering the cell body, called the dendrites, and 
one special fiber leaving the soma, called the axon. The dendrites transmit trains of 
electrical pulses towards the soma and the axon conducts trains of pulses away from 
the soma. The axon terminates by branching into many filaments. The filaments end 
in bulb-like structures called synapses that make contact with dendrites or somas of 
other neurons. The transfer of electrical pulses from the final filaments of some axon 
to the dendrites or soma of another neuron depends on the synaptic transmission 
efficiency, represented by the variable w. If the synaptic transmission efficiency is 
positive, the synapse is said to be excitatory, if negative, the synapse is called inhibitory. 
The positive or negative transmission efficiency may vary between small and large 
values. Only when the sum of ‘synaptic weighted’ incoming pulses is greater than 
some threshold is a train of pulses generated by the soma and transmitted by the 
axon (see Figure 2.1). In organic neural tissue the pulse frequency may vary between 
a few pulses per second up to twenty pulses per second. An additional simplification 
is to disregard the frequency of the pulse trains and to consider only the presence, 
represented by the number 1, or absence represented by the number 0, of a pulse 
train. This simplified model of a real neuron is called the one-or-zero behaviour of a 
neuron. 

This simplified model of a neuron can easily be simulated by an artificial neuron 
(see Figure 2.2). Dendrites are represented by input lines and a variable x; represents 
the presence [x;(t)= 1] or absence [x,(t)=0] of a pulse train on fiber i at time t. Every 
artificial neuron has one output line representing the axon of the neuron and the 
presence or absence of a pulse train at an axon is presented by the value 1 or 0 of 
the variable y(t). There will be one special input line with a constant input x, =1, and 
a weight wọ. This constant input, x)= 1, and the weight wọ realize a threshold equal 
to —wọ. When the ‘weighted’ sum of incoming signals is greater than the threshold 
T = — wọ the output y will become 1 after a delay t. The input-output behaviour of 
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` Synapses t 






Figure 2.2 Artificial model of a neuron 


an artificial neuron is now specified by: 


e+n= =] wel Mp 


=0 otherwise 


The variable w; is called the weight of input line i and represents the synaptic 
transmission efficiency of the synapse between the final filament of a neuron and the 
dendrite i (or the soma) of a particular neuron. The threshold T = — wg, the weights 
w; and the delay t are real valued. If there is no feedback in the neural network we 
may take t=0, and the time dependency of x; and y can be ignored. So the previous 
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out 


Figure 2.3 Electronic implementation of an artificial neuron 


formulation of the input-output behaviour can be replaced by: 


ies =1 if) wx;>—wWo 


=0 otherwise 


An artificial neuron can easily be implemented in a simple electronic circuit (sce 
Figure 2.3). Those acquainted with electronics will understand that the transistor will 
be open if 


E „E2, Us 
R, R R 


If the voltage E, represents x,, E, represents xz, and — U,/R; represents the threshold 
—Wo, then we obtain with 1/R,=w, and 1/R,=w, that the transistor is open if: 


W 1X, +W2X2> —Wo 


Networks composed of layers of interconnected artificial neurons have been studied 
extensively by many authors. The analysis of neural networks is attractive because 
all the building units, the neurons, are the same and the transfer function of such a 
unit is quite simple. More important, however, is that we can alter the behaviour of 
a neuron by changing in a learning process the weights w, in the input lines. Changing 
weights is the artificial counterpart of the adaptation of the synaptic efficiency in real 
organic neural networks. Before examining this learning behaviour of a neural 
network, we consider the ‘zero-or-one behaviour’ of just one single artificial ‘binary’ 
neuron. 

With a single neuron, for example, we can realize some restricted class of predicate 
logic. Consider the statement: ‘John is going out for a walk if and only if the sun is 
shining or if it is cold and the wind is blowing west. The predicate ‘John is going 
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Table 2.1 
Xi X2 X3 y 
0 0 0 0 
0 0 1 0 i 
0 1 0 0 ‘ 
0 1 1 1 
1 0 0 1 
1 0 1 1 
1 1 0 1 
1 i 1 1 
x =1 
x, -1 
+2 
x3 +1 ; 
oO 


Figure 2.4 Neuron illustrating Table 2.1 


out for a walk’ is only TRUE if the conditions mentioned are TRUE. Now we can 
represent the ‘truth value’ TRUE by the number 1 and FALSE by the number 0. 
We represent the truth value of the predicate ‘The sun is shining’ by x,, the truth 
value of ‘It is cold’ by x, the truth value of ‘The wind is blowing west’ by x, and the 
truth value of ‘John is going out for a walk’ by y. With this notation we can enumerate 
all possible situations in a simple truth table, as shown in Table 2.1. 

If we now consider the values of x,, x, and x, as the inputs of a single neuron 
and y as the output, we can select the weights wo, w,, wz and w; in such a way that 
the output behaviour of that neuron yields the truth value of the predicate ‘John is 
going out for a walk’ (Figure 2.4). Methods for finding the appropriate weights 
analytically. or by a learning process, will be discussed later. 

Pioneers in this field of research, like Rosenblatt (1962) and Minsky and Papert 
(1969), investigated neural networks with the aim of using such networks mainly for 
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Figure 2.6 Classification of sixteen patterns by a neural network 


pattern recognition problems. For this reason they called those networks etter 
(from perception). In honour of Rosenblatt, who used the term first, we will call the 
networks discussed in Chapters 2 and 3 Perceptrons. 

As an example we consider a pattern recognition problem. Some pattern Is projected 
onto a grid of small squares. A variable x, is assigned to each square. The Pie 
x, will have the value 1 if the pattern is covering that particular square, and 0 if the 
pattern is not covering that square (see Figure 2.5). The values of the variables x, 
constitute the inputs of a binary neural network. The output of the neural network 
will classify patterns as belonging to some predefined class (y= 1) or not (y=0). 

For example, if we have a very small grid of four squares and one single neuron 
(see Figure 2.6), one may wish to classify the sixteen different artificial patterns (see 


Introduction 1 





Table 2.2 
xy X2 X3 X4 y Xi X2 X3 Xa y 
0 0 0 0 0 1 0 0 0 0 
0 0 0 1 0 1 0 0 1 0 
0 0 1 0 0 1 0 1 0 1 
0 0 1 1 1 1 0 1 1 1 
0 1 0 0 0 1 1 0 0 1 ` 
0 1 0 1 1 1 1 0 1 1 
0 1 1 0 0 1 1 1 0 1 
0 1 1 1 1 1 1 1 1 1 





Table 2.2) as whether patterns of at least two black squares are connected (i.e. the 
black squares are adjacent), y=1, or not, y=0. 

Although there are many pattern classification problems that can be solved with 
a single neuron, we will demonstrate in the next chapter that there exists no set of 
weights Wo, Wi, W3, w3 and w, such that Table 2.2 is realized by a single neuron 
classifier. It turns out that we need a two-layer neural network with at least two 
neurons in the first layer. We can see directly that the problem can also be solved 
with four neurons in the first layer (one neuron for detecting that x, and x, are both 
1, one neuron for detecting that x, and x, are both 1, one neuron for detecting that 
x, and x, are both 1, and one neuron to detect that x, and x, are both 1), and one 
neuron in a second layer to detect that least one neuron in the first has an output 
of 1. One might suspect that a single neuron is not able to solve a classification 
problem if there are a great number of input variables. There are, however, problems 
with only two input variables that are also not solvable with a single neuron. 

Consider, for example, the Boolean ‘exclusive-or’ function: y =x, @xz, i.e. the output 
of the single neuron must be 1 if and only if x,=1 or x,=1 but not both. We will 
see in the next chapter that we cannot solve this problem with a single neuron but 
we will demonstrate, on the other hand, that any Boolean function can be realized 
with a two-layer Perceptron. This example indicates at the same time a third 
application area for the use of binary Perceptrons: the realization of Boolean functions. 

Because we can realize any Boolean function with a binary Perceptron and because 
every neuron can be implemented with an electronic circuit, we have the fourth 
application area: switching circuits. 

In a subsequent section of this chapter we will study two-layer binary neural 
networks composed of interconnected artificial neurons without feedback connections 
between neurons. 

Different Boolean functions can be realized in parallel, e.g. with the two-layer 
neural network given in Figure 2.7. The neural net of Figure 2.7 classifies simple 
patterns consisting of three pixel points in three classes as specified by Table 2.3. 

Ifa pattern p=<p,p2p3> isa member of the class K, = {{000), <001>), <100>, 1115}, 


~ a v 
i 
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Figure 2.7 Neural net for classification with two-dimensional output 


then y; =1 and y,=0. If the pattern is a member of the class K, = {<010), <011)}, 
then y, =Oand y; =1. Ifthe pattern is not a member of K, or K2, then y, =Oand y, =0. 


2.2 The performance of a single-neuron binary Perceptron 


In the previous section we saw that a single neuron performs a kind of ‘weighted 
voting’ on variables x,: the output y of the neuron will be 1 if and only if 
W,X,+W 2x, ++": +w,x, is greater than some threshold T. 


Example 2.1 


Consider the balance of Figure 2.8. At equally spaced points there might 
be objects with some weight g, at the balance pole. At the left-hand side there is one 
fixed weight go attached to the balance at a unit distance from the suspension point. 
We use the variable x, to indicate whether (x, = 1) or not (x, =0) if there is an object 
placed at distance k from the point of suspension. 

Now ‘The balance will tip to the right’ if and only if: 


Vkaxe> Jo 
or with kg, replaced by wx go= T, and when the predicate ‘The balance will tip to 


severe 
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Figure 2.8 The mechanical equivalent of a threshold function 


the right’ is replaced by the binary variable y, we obtain: 


y=1 if and only if }. w,.x,>T a 


We will now investigate the properties of threshold functions like the one used 
above. We define first: a function y= f {x1, X2,---, Xn) is called a binary linear threshold 
function with respect to the binary-valued variables x,, X2,..-) Xn if there exist a number 
T and a set of numbers {w,, w,..., w,} Such that y=1 if and only if È w;x;> T. We 
usually will drop the adjective ‘binary’ and, although important, we will frequently 
drop the phrase ‘with respect to the binary valued variables x,, x2,..., KA n 

With the use of the step function S(z) defined by: S(z)=1 if z>0 and S(z)=0'if z <0 
we can equivalently say that y= f(x,,X2,...,%,) is a linear threshold function if 
y=S(Z w;,x;— T). 

If the weights w; constitute the components of a so-called weight vector w and the 
variables x; are the components of the input vector x, we can also write a linear 
threshold function as y=S(w'x — T), with w' the transpose of the vector w. 

Note that a linear threshold function is not a linear function in the ordinary sense, 
because for a linear threshold function we have f (ax) #a/f (x). 

In the introduction to section 2.1 we have demonstrated that a linear threshold 
function can be realized by a single-neuron binary Perceptron with the threshold 
T= —wọ realized by a constant input xọ=1 and a corresponding weight wo. If we 
introduce a so-called extended weight vector W with W'=[Wwo, Wi, W2,-.-, Wn], and a 
so-called extended input vector X with X'=[1, x,,X2,.-.,,], then we can compactly 
write a linear threshold function realized by a single neuron Perceptron as y = S(W'X). 

Now we concentrate on the class of logical functions that can be realized by a 
single-neuron binary Perceptron. If y= f(x) is a logical function of two variables x, 
and x, we can have sixteen different functions. (NB: there are 2? different arguments 
and for each argument the function value can be either 1 or 0.) Of those sixteen 
functions, fourteen can be realized by a single-neuron binary Perceptron. The 
two functions that cannot be realized are specified by Table 2.3. The first function 
is called the exclusive-or function and the second the identity function. 

We will prove now that the first function is not a linear threshold function and 
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Table 2.3 

xy x2 y x X2 y 
o o0 0 0 A 
0 peel. 6 1 0 
1 0 4 1 0 0 
1 1 0 1 1 1 


thus it cannot be realized by a single-neuron binary Perceptron. Assuming, however, 
that we can realize the exclusive-or function, then there must be weights wo, w, and 
w, such that: 

For the first argument [0, 0] we must have: 


Wo +wx; +W2X2 <0 thus: wo <0 
For the second argument [0, 1] we must have: 
Wo + wX, +w2X2>0 thus: wo + Ww, >0 
For the third argument [1,0] we must have: 


Wo + wX, +W2X2>0 thus: wo +w, >0 


For the fourth argument [1, 1] we must have: 
Wo +wx; +W2X2 SO thus: wo tw, +w: <0 


One can easily verify that the four inequalities for the weights cannot be satisfied 
simultaneously. This completes our proof. 

Our conclusion that the exclusive-or function cannot be realized by a single-neuron 
Perceptron also becomes clear when we consider the realization of the exclusive-or 
function as a classification problem. We have two classes of points in a 
two-dimensional input space (see Figure 2.9). For one class of points {(0, 1), (1, 0); 
the output of the neuron must be 1, and for the other class {(0, 0), (1, 1)j the output 
must be 0. Thus for the first class we must have wọ +wx, + x,>0, and for the 
second class we must have wo+w,X,+W2x;<0. The set of points for which 
Wo + W,X, + W2X, =0 represents a separating line in the two-dimensional input space. 
On one side of this line we will have for every point (x,, x2) that wo +w x, +W2x;>0 
and thus the output of the neuron will be equal to 1. On the other side of the line 
we will have wọ +wx; +wx; <0 and the output of the neuron will be 0. Now one 
can easily check that we cannot locate a line between the two sets of points that 
must be separated; thus there exists no solution for our classification problem. 





The performance of a single-neuron binary Perceptron 15 





(0,1) (1,1) 
e 






(0,0) (1,0) 





Figure 2.9 Exclusive-or as a classification problem 





Table 2.4 

xı X2 x3 y 
1 0 0 0 0 
2 0 0 1 0 
3 0 1 0 0 
4 0 1 1 1 
5 1 0 0 1 
6 1 0 1 1 
7 1 1 0 1 
8 1 1 1 1 


Fortunately we can show in a later section that the problem can be solved with a 
two-layer network. 

We will now discuss the more general situation of the realization of logical (or 
binary) functions with n arguments with a single-neuron Perceptron. There are 2” 
different functions with n arguments. A large number of these functions are not linear 
threshold functions with respect to the variables x,,x,...,X, and thus cannot be 
written as y=S(È w;x;— T) and thus cannot be realized with a single-neuron binary 
Perceptron. We will give an example of how to find the linear threshold function (if 
it exists) given the truth table of a logical function with three arguments. 


Example 2.2 


Let a function be specified by the truth table given in Table 2.4 (see also Figure 2.4). We 
want to obtain a function of the form y= S(Z w,x,— T), or with the threshold T replaced 
by the weight — wọ we want to find the expression y=S(wg + w,X, + WX) +W3X3). 


4 w 
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For the successive arguments we must have the following: 


Wo <0 

. Wo tw3<0 E 

A wọo+tw:<0 H o 4 
Wo tw, +w3>0 me 
. Wot w,>0 

Wo tw, +w3>0 

Wo tw, +w,>0 

Wo +W, +w +w3>0 


ON NDMNP WN 


From (1) and (4) we conclude w, + w3 >0. Take w,= 1 and w= l, then from (2) and 
(3) we conclude wọ < — 1. Take wọ = — 1. From (5) we infer w, = 2. With this selection 
of weights all inequalities are satisfied and we obtain the following linear threshold 
function: y= S(—1+2x,+x2+%3). a 


From this example it becomes clear that there cxists, within the bounds given by 
the set of incqualities, a certain amount of freedom to select the weights. For instance 
we could select as well: wọ= —2, w, =4, w2=2 and w= 2; another selection could 
be: Wo = —1, w, =4, w.=1 and w,=1. We will return to this topic in Section 2.3. 

As stated before, not all logical functions are linear threshold functions with respect 
to the variables x,, X2,-..,X,- It is not easy to determine whether a given function is 
a linear threshold function or not. At present there is only one more or less practical 
method by which this can be done, and that is by determining whether or not the 
set of inequalities associated with the logical function contains a contradiction. 


Example 2.3 
Consider the logical function y= f(x,, X2, X3) defined by y=! if and only if the 


number of 1 in the argument is odd. This function is known as the parity 
problem and is specified in Table 2.5. 








Table 2.5 
Xi X2 X3 y 
1 0 0 0 0 
2 0 0 [i l 
3 0 1 0 1 
4 0 1 1 0 
5 1 0 0 1 
6 1 0 1 0 
7 1 1 0 0 
8 1 1 ! ! 
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We want to find an expression of the form: y= S(w9 + w, x, + W2X2+W 3X3). For the 


“ successive arguments we must have the following: 
-Wọ <0 

. Wotw3>0 
. WotwW2>0 


Wo + W2+W3<0 
Wotw,>0 


$ wow, +w <0 
. Wo tw, +w, <0 
. Wotw, +w +w,>0 


In this case we can see immediately that the parity problem cannot be solved with 
a linear threshold function, because from (1) and (2) we conclude that w,>0 and 
from (3) and (4) we conclude that w, <0; hence we have a contradiction. 


In the general case of n variables we have to investigate 2” inequalities. We can 


save ourselves a great deal of effort if we eliminate redundant equations and simplify 


expressions by using the following set of properties. 


Properties of inequalities 


NIDNDUAPWN- 


. a>0 and b>0>a+b>0 
. a<O and b<0>a+b<0 
. a>0 and a+b <0=b<0 
. a<0 and a+b>0=b>0 
. a+b>0 and a+c <0=b>c 
. Ea;>0 and 4>0=2 ża;>0 

. Za;<0 and 4>0= ża;<0 


From rules (1) and (2) we can derive a property that we can sometimes use to check 
whether a logical function can be a linear threshold function or not without writing 
down the set of all inequalities. Let a in rule (1) represent the sum of some set of 
weights associated with some input vector x; with f(x,)=1. Thus the first inequality 
in rule (1) with a=W-x, >0 must hold. Let b in rule (1) be the sum of weights associated 
with some vector x, with f(x,)=1. Thus the second inequality in rule (1) with 
b=w-x, >0 must hold. According to rule (1) we must have WX, + Wx, >0. Assume 
we have for the zero vector 0: f(0)=0. This implies wọ <0 and thus (— wo) + WX, + 
WX, >0. Let x, and x, be vectors with no Is in the same position, we write x; OX, =Q. 
Let z be a vector obtained from the two input vectors x, and x, such that z,;=1 if 
X,,=1 or x,;=1 otherwise z;=0, we write z=x,Ux,. Vector z has a corresponding 
inequality W-2=(— wo) + WX, +W°x,. Because (— wo) + WX, +W°X, >0 we must have 
J (z)=1. The same kind of reasoning holds if f(x,)=0, f(x,)=0 and f(0)=1. In that 
case f(z)=0 must hold. Thus we can write the consistency property of the binary linear 
threshold function: If the logical function y= f(x) is a binary linear threshold function 
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Table 2.6 
xy X2 y 
F 0 1 

ia 

oh 1 0 
13 0 0 
l 1 1 

with respect to x,,X2,...,x, and f(x)=u 


with z=x,;Ux, and x^xj=0, and u=1 
Example 2.4 


For the identity function we have Table 2.6. From Table 2.6 we find: 
f(0,0)={1 and thus wọ>0 
f(0,1)=0 and thus wọ+w <0 
f(1,0)=0 and thus wp +w, <0 


For f to be a linear threshold function we must have f(1,1)=0, which implies 
Wo +w, +w, <0. However, f (1, 1)= 1, and thus f cannot be a linear threshold function 
with respect to x, and x). m 


Many logical functions of n arguments cannot be realized by a single-neuron 
Perceptron. This also becomes clear when we consider the determination of a logical 
function as a classification problem. We have two classes of points in a n-dimensional 
input space (see Figure 2.10) for the parity problem as presented in Example 2.3. For 
one class of points A = {(0, 0, 1), (0, 1, 0), (1, 0, 0), (1, 1, 1)} the number of Is is odd and 
the output of the neuron must be 1, and for the other class B = {(0, 0, 0), (0, 1, 1), (1, 1, 0), 
(1,0, 1)} the number of Is is even and the output must be 0. Thus for the first class 
we must have wo +W 1X, +W2X2+W3x3>0 and for the second class we must have 
wo +wx; +W2X2+W3xX3<0. The set of points (x,, x2, x3) for which wo+wyx, + 
wX +w3X3=0 represents a two-dimensional separating plane H in the three- 
dimensional input space. On one side of this plane H we must have for every 
point (x,, X2, X3) that wọ + w,X, +w2X) +W3X3 >0 and thus the output of the neuron 
will become equal to 1. On the other side of the plane H we want to have 
Wo + WX, +W2X2+W3X3 <0 and the output of the ncuron must be 0. Now one can 
easily check that we cannot locate a plane between the two sets of points that must 
be separated, thus there exists no solution for our classification problem and thus 
the parity function cannot be represented by a linear threshold function. Or 
equivalently one can say: both sets of points are not linearly separable. 

In the n-dimensional case we must have a (n — 1)-dimensional separating hyperplane 
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Figure 2.11 One-dimensional separating hyperplane H in a two-dimensional 
space 


H in the n-dimensional input space X if the logical function is a linear threshold 
function with respect to x,,X»,...,X,. The separating hyperplane is defined by 
WX + WX) + ++ +W,X,= — Wo. The weight vector w=[w,, w2,..., w,] is orthogonal 
to the separating hyperplane. This becomes clear when we take two n-dimensional 
input vectors x, and x, located on the hyperplane. Figure 2.11 shows the 
two-dimensional case. For these vectors x, and x, we have that w(x,—x,)=0 and 
thus w and the hyperplane are orthogonal. 

The separating hyperplane H divides the n-dimensional input space X into two 
half-spaces, the region X* where È w;x;> — wọ (i.e. y= 1) and the region X~ where 
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Figure 2.12 Decomposition of the input vector x 


Z wx; S — Wo (i.c. y=0). Since a vector x in the region X* will give w'x > — wo, the 


weight vector w points into the region X*. It is said that x is on the positive side of 


the hyperplane H if x is in X*, and x is on the negative side of H if x is in X7 

The distance d from the origin to the hyperplane H is equal to the projection of 
a vector x in H (i.e. w'x = — wọ) on the unit vector w/|w| normal to H. Thus d= 
or with w'x = — wọ we find for the distance along w from the origin to the hyperplane: 





— Wo 





|w] 


The distance ô from the hyperplane H to an input vector x is proportional to 
w'x + Wo. The easiest way to sce this is by decomposing the vector x into a component 
in the direction of w and another component x, orthogonal to w (sec Figure 2.12): 


w 
x=XxX,+Å— 
|w] 


By forming the vector product w'x and noting that w'x,, = 0 we obtain for the length 
of the component x along w: 


By subtraction of the distance d from the origin to the hyperplane we obtain for 
the distance ô from the hyperplane to x: 
Wx + Wo 
6=— 
iw 


an vena a aaa 


z 
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We finally mention that one can prove (see Cover, 1965) that in the n-dimensional 
case the number of linear threshold functions is equal to: 


Cin=2 co 


For partial functions, 2” must be replaced by the number of samples. ` 


2.3 Equivalent linear threshold functions 


This section on equivalent linear threshold functions can be omitted on a first reading 
of this book. For a more profound understanding of the learning capabilities of a 
binary Perceptron, however, one must know that in general a binary Perceptron 
yields an infinite number of solutions to the same problem. 

In the next chapter we will give a learning procedure of how to find from a set of 
samples of a logical function, a single-neuron binary Perceptron that will realize that 
logical function, if the logical function is a linear threshold function with respect to 
the variables x,,X>,...,X,. It will turn out that the weights of the linear threshold 
function, obtained after learning by the neuron, will depend on the particular sequence 
of applied samples. Different threshold functions can, however, realize the same logical 
function, known as equivalent linear threshold functions. 

In addition, it is worth determining which features of linear threshold functions 
are essential and which are arbitrary. It frequently occurs that the descriptions of 
linear threshold functions are different, i.e. the weights w; are different, whereas the 
realized logical function is the same. We will now give four different simple theorems 
on equivalent linear threshold functions. The first theorem is important, though its 
proof is trivial. 


Theorem 2.1 


If y= S(È w;x; + Wo) is a linear threshold function, then y'= S(Z żw;x 
is an equivalent linear threshold function. 


¿+åiwọ) with A>0 


Example 2.5 
Let y=S(x,+x2—X3) and thus w, =1, w= | and w;= — 1, then with 4=2 the linear 
threshold function y= S(2x, +2x, —2x,) is equivalent to y. | 


Theorem 2.2a 


If y=S(Zw;x;+ wo) is a linear threshold function and y =E w;x;+wọ is a linear 
function with respect to the binary variables x,,X2,...,x, such that if y'>0 


n 


22 The binary Perceptron 





Table 2.7 

xy xX: X3 y y y” 
0 p o 0 =f 0 
0 of i 0 —1 0 
0 19 0 1 0 1 
0 1 1 0 0 1 
1 0 0 1 0 1 
1 0 1 0 0 0 
1 1 0 1 1 1 
1 1 1 1 1 1 


then y=1, and if y'=0 then y=! or y=0, and if y’<0O then y=0, then 
y” = S(L(w, + Aw))x; + Wo +Awg) with 220 is equivalent to y=S(Z w;x; + Wo). 


Proof 


Because we add to the value of the argument of the step function in y a 
positive or zero value when y=1 and we add to the value of the argument of the 
step function in y a negative or zero value when y=0, it will be clear that y =y” for 
all the values of the argument. QED 


Example 2.6 


Let y=S(x,+x,—x3) as specificd in Table 2.7. Let y' =x, +x,—! as specified in 
column 5 of Table 2.7. The conditions on y’ are satisfied so the linear 
threshold function y” =S(3x, +3x,—x,—2) (with A=2) is equivalent to y. 

The trivial complement of the previous theorem is as follows: 


Theorem 2.2b 


If y=S(Xw,x;+wo) is a linear threshold function and y’=Zwjx,+wo is a lincar 
function with respect to the binary variables x,,x2,...,x, such that if y’<0 
then y=1, and if y'=0 then y=I or y=0,-and if y'>0 then y=0, then 
y" = S(Z(w; + Aw))x; + Wo + Awe) with ¿<0 is equivalent to y= S(È w;x; + Wo). 


For the next theorem we need some auxiliary concepts and lemmas. We first define 
X* as the set of argument values of a logical function for which y= 1; in the same 
way we have X7 as the set of argument values for which y=0. 


Lemma 2.1 


1. If y=S(Z w;x;+ wo) is a linear threshold function, then y = S(Z w;x;+wọ— A+) with 
O<A* <min(Zw,x;+ Wo) over X* is an equivalent linear threshold function. If 


Equivalent linear threshold functions 23 


y” =S(Lw,x;+wWwo—A_) with 0> A` >max(Zw,x;t+wo) over X7, then y” is also an 
equivalent linear threshold function. 


Proof 


When we subtract from the argument of the step function in y a positiye 
constant A less than min(Zw,x;+Wo) over X*, then the argument of the 
step function will remain positive for all elements of X * and the argument of the step 
function will be negative or zero for all elements of X~ thus y will remain the same 
for all elements. If we add to the argument a positive constant value A smaller than 
or equal to —max(Zw;,x;-+Wo) over X7, then y will also remain the same for all 
values of the argument. QED 


Example 2.7 


The linear threshold function y= S(x, +x — x3) is equivalent to y =S(x, +x, — x, —0.5) 
because A* =0.5 and 0<0.5<min(Z w;x;+Wo)=1 over X* (see Table 2.7). Note: 


A’ =0. a 
Lemma 2.2 
If y=S(Zw,x;+Wo) is a linear threshold function wart. xX}, X,...,Xps ‘then 


y =S(X(w;—6)x;+Wo) is an equivalent linear threshold function if 0<6,< 
(1/n)min, (Z w;x;+ Wo) over X* or if 026;>(1/n)max; (X w;x;+ Wo) over X7. 


Proof 


If, in the worst case, for all x; we have x;=1, we subtract in the first case from 
the argument of the step function in y the constant nd; with 0<1n6d;< min, (È w;x;+ wo) 
over X*. According to Lemma 2.1, we obtain in that case an equivalent linear 
threshold function. The same holds if 0 >ô; >(1/n)max; (È w;x;+ Wo) over X7. QED 


Example 2.8 


The linear threshold function y= S(x, +x — xs) is equivalent to y=S[(1—1/6)x,+ 
(1—1/6)x, —(1 — 1/6)x3] because 6;=1/6 and 0<1/6<1/3min,(Z w,x;+Wo) over x*. 
a 


Theorem 2.3 


If y= S(2 w;x;+ wo) is a linear threshold function, then there exists a linear threshold 
function y’ = wix;+ wọ such that all weights are integers. 


i e 
ua 7 
| 
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Proof 


if all weights in y are ratioria} numbers, we can form the product D of all 


denominators of the weights and multiply all weights by D. All weights will become 
an integer and the obtained: thiteShold function y’ is, according to Theorem 2.1 
equivalent to y. If a weight w; is a réal number, we can replace weight w;, according 
to Lemma 2.2, by a rational weight w; such that w;=w;— ò; and rational weights can 
be replaced by integer weights. This completes our proof. QED 


2.4 Learning a single-neuron binary Perceptron with the 
reinforcement rule 


Although a single-neuron binary Perceptron is not of great practical use, because 
only a few of the logical functions are lincar threshold functions with respect to the 
variables x,,X,...,X,, We can nevertheless gain much understanding of more 
complicated networks by investigating the learning behaviour of one building unit. 

In Section 2.2 we demonstrated how to determine the weights of a single-neuron 
binary Perceptron from a set of inequalities. Now we investigate how we can adapt 
step by step in a learning process the weights of a neuron in order to identify some 
logical function. In fact we can use the learning process as an algorithm to solve a 
set of inequalities. We assume that the function to be realized by the Perceptron is 
a linear threshold function. 

At a given step of the learning process we have some extended weight vector 
W=[wo, ,,...,,]; the output will be correct for a subset of all arguments of the 
function to be identified, and for the remaining arguments the output will be wrong. 
The set of arguments for which the target output is equal to 1, whereas the actual 
output is equal to 0, will be denoted by T$, the set of arguments for which the 
target output is equal to 0 and the actual output is 1 will be denoted by Ty. The 
arguments X,,X,...,x, of the function y will be extended with the constant internal 
input xọ=t of the neuron. It will turn out that we have to change the weights 
proportional to the elements of TY and negative proportional to the elements of Tz- 


Example 2.9 


The function y to be realized is specified in Table 2.8. Note that in Table 2.8 the 
extended inputs [xo, X1; X2] with xọ=1 are given, whereas y is a logical 
function of x, and x,. The neuron in the initial learning state (iec. at step k=0) has 
the weight vector W(0) =[w (0), w,(0), wa(0)]' = [0.5, 1, — 1J! (sce Figure 2.13). For the 
output }(0) of the neuron at step k=0 we have (sce Table 2.8) the following: 


(0) = S[wo(0) + w,(O)x, + wax] = S(0.5 +x, — Xa) 


We see that for the extended input vectors [1, 0,0]! and [1 EOJ the output is 
wrong. The only way to improve the output for the vector [E O. 0]! is by decreasing 
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. Table 2.8 
Xo xy X2 y y? 
w 
1 0 0 0 1 wat 
1 0 1 0 o0 
1 1 0 0 1 ` 
1 1 1 1 1 





% 


Figure 2.13 Single neuron with initial weights 


the value of wo. So we can change the weight vector W=[wp, w,, wz]' by an amount 
Aw proportional to — 1[1, 0, 0]', i.e. minus the first extended input vector. To improve 
the output for the vector [1, 1,0]' we have to decrease the values of wọ and w,. So 
we can change the weight vector by an amount Aw proportional to —1[1, 1, OJ}, ie. 
minus the third extended input vector. 

We can take the proportionality equal to | and add both increments AW. Thus we 
obtain AWw=[—2, — 1, 0J'. The new weight vector becomes W(1) =[wo(1), w,(1), wa(1)]' = 
[—1.5,0, —1]'. Because: 


W(1)=SLwo(l) + w (Lx, + w2(1)x2] = S(— 1.5 —x,) 


we observe that now we obtain the wrong output only for the input vector [1, 1, 1]. 
To improve for that argument the output of the neuron we have to increase wo, w, 
and w,. Thus we can take AW proportional to the corresponding misclassified vector 
[1, 1, 1] We can proceed in the same way as above and after a finite number of 
steps we will obtain a correct response of the single-neuron Perceptron, if the original 
function is a linear threshold function with respect to x,,X2,...,X,. In Table 2.9 the 
results for six learning steps are given. The final linear threshold function is equal to: 


y=S(—1.5+x,+x,) a 
Let in general X* be the set of argument values with target output y=1 and X~ 


the set of argument values with target output y=0. For a given value of the weight 
vector W, a subset T4 of X* and a subset Ty of X` are misclassified. 
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Table 2.9 

Step Wo wi w2 T* T 

0 w 0.5 i; i -1 — [1,0,0], [1, 1,0] 
AW —2 L 0 

1 w —1.5 0 —-l [141] — 
Aw 1 1 1 

2 Ww —0.5 1 0 — [1, 1,0] 
AW -1 -1 0 

3 Ww —15 0 0 (1, 1, 1] — 
Aw 1 1 1 

4 W —0.5 1 1 — Ct, 1, 0], [1,0, 1] 
Aw -2 -1 -1 

5 Ww —2.5 0 0 [1,1,1] 
Aw 1 1 1 

6 Ww —1.5 1 1 — — 





We will show that for a convergent learning process we can modify the weight 
vector W by adding Eef; summed over %eT,*, where e>0 is a proportionality 
constant, and subtracting from W the value of Lex, for XT, ~. Because at each step 
we use the total set T,* UT,” for modifying the weight vector we will call this way 
of learning global learning. However, we can also modify at each step the weight 
vector with only one element of Ty* UT, ~. This way of learning will be called local 
learning. The learning rule is called reinforcement learning for both cases. The 
proportionality constant e is called the learning rate. 

We see that the output of a single-neuron binary Perceptron at a certain step k 
depends on the actual input and on the weight vector Ŵ(k) at step k. We can consider 
the value of the weight vector at step k as the state of the learning system. This 
enables us to describe the single-neuron binary Perceptron within the framework of 
the theory of finite sequential machines. A finite sequential machine is described by 
a state transition function 6 and an output function À. Given the state W(k) and the 
input x, the next state is defined by W(k + 1)=6(W(k), R) and the output at step k is 
defined by y(k)= A(W(k), X). The behaviour of a sequential machine can be graphically 
represented by a so-called state diagram. In a state diagram, states are represented 
by circles and transitions between states by arrows pointing from the actual state to 
the next state. Each arrow has a label indicating the supplied input and a label 
corresponding to the output. 

We can define the output function 4 of the sequential machine corresponding to a 
single-neuron Perceptron as: 


A(W, R) = S(Wo + wx, HWX + e +W,X,) 


In case of local learning and with the learning rate e equal to 1, we can define the 





Learning a single-neuron binary Perceptron 27 





Table 2.10 
P Xo xy X2 y y? 
Po 0 0 o i 
P: 1 0 1 0 1 
P2 1 1 o o 1 
P3 1 1 1 i 1 
state transition function 6 as: 
w if A(w, x) is correct 


O(w,x)=<w+x ifxeTy 


w—x ifxeTy 


Example 2.10 


The function y to be realized is specified in Table 2.10. We use the extended input, 
i.e. with the constant xọ= 1. The function y to be identified is a logical function of 
x, and x,. The different input values will be denoted by po, Pı, P2 and p3. The 
neuron in the initial learning state (i.e. at step k=0) has the weight vector 
W(0) = [19(0), w,(0), 2(0)]'=[1, 1, 1]!. The output is defined by: 


(0) = S(wo(0) + w,(0)x; + w2(0)x2)= S(1 +x, +2) 


Only for input p, is the output correct, and so we remain in the same state only for 
that input. For input pp€¢7T,, the output is 1 and the next state is w(0)— po = [0, 1, 1]! 
For input peT% the output is 1 and the next state is w(0)— p, =[0, 1, 0]'. For input 
p.eT,y the output is t and the next state is w(0)—p,=[0, 0, 1]' (see Figure 2.14). We 
can continue this process in the same way, with the final result as shown in Figure 2.15. 

a 


In order to find a weight vector for a correct realization of a logical function, we 
found intuitively that in the learning phase the weight vector must be increased or 
decreased with vectors proportional to the misclassified extended input vectors. 

We can see the same in a more formal way. In the case of local learning the 
adaptation at step k becomes: 


Wk+l=Wk)+ek; if KeTF 
and 
Wk+l=Wk)—ex; ifef; 


In the case eT we have prior to adaptation that the inner product W(k):%; <0. 
After adaptation we have W(k + 1)&;= W(k):%; + «|X,|?. So we add a positive number 


‘hh Y 
ý g 
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A P, 


P; 








Figure 2.14 The initial state diagram of Example 2.10 


PaP,Ps PoP;P, PoP; PaPiP, PoPiP, PyP,P, PoP, P:P, 








Figure 2.15 The complete state diagram of Example 2.10 


e|X,|? to the old inner product, and the inner product is changed into the desired 
direction. 

In the case %€T 7 we have prior to adaptation that the inner product ŵ(k) $; >0. 
After adaptation we have W(k + 1)*%; = W(k)-&;—c|&,|?. So the inner product is again 
changing into the desired direction. 

In cases of global learning the adaptation at step k becomes: 


Wk + I= Wk) +E X;-L%) with Refy and RET, 


We can conceive £X;— =X, as one misclassified correction vector ê of Ty. We have 
prior to adaptation that the inner product w(k)é<0. After adaptation we have 
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_ W(k+ LS = W(K)-é + elêl?. Thus in the case of global learning the inner product is also 


changing into the desired direction. We see that at each step the behaviour of 
the single-neuron binary Perceptron is improved with respect to the misclassified 
input vectors at that step. However, in general after an adaptation step the set of 


misclassified vectors is changed, so we have to adapt the weight vector for that new. 


set of misclassified vectors. 

One may wonder whether we can always be sure that we will finally arrive at a 
state where the outputs are correct. It might be supposed, for instance, that we may 
never finish updating the weight vector because the process can enter a loop of states. 
However, we will show that when at a certain step a state (i.c. a weight vector) Wk) 
is obtained, we will never return to the same state if there exists a solution. 


Theorem 2.4 


During the learning process a single-neuron binary Perceptron will never enter the 
same state more than once if there exists a solution space. 


Proof 


Let W(k+n) be a state reached from W(k) after n additional adaptations; we 
then have a sum or vectors added to w(h): 


ALK, — LR) with KET tap and SET kasp with p=(0,1,....n-1) 


Let § (a solution vector) be a weight vector for which a correct solution is obtained. 
We consider the inner product of $ and w(k +n): 


S-W(k + 1) = S-W(k) + €(Z SX; — E Sk) 
summed over X ET atp and X ET wen with p=(0, 1,...,n—1). 

For every RET wisp we have §*X;>0 and i every | Š; See we have SX; <0. 
Thus § w(k +n) 4$§-w(k) and hence w( (k +n)#ŵ(k 

(It may happen that y EE is empty for all sänd that §*X;=0, thus in that case 
§-w(k + n) =8-wW(k). However, in that case we can take another solution vector s’ from 
the solution space such that 8%; #0.) QED 


Although we will never return in the same state during learning, this does not 
guarantee that the learning process will stop because the number of states in the 
space containing incorrect states is infinite. However, the quotient of the number of 
correct states and the number of incorrect states is finite. The subspace S of the 
extended weights space W containing all correct weight vectors is called the solution 
space. The solution space also contains an infinite number of weight vectors, as 
becomes clear from the following arguments. 

If W is a solution vector, then the inner product W-%>0 for all eX * and Wk <0 
for all Re ~. The same holds for Bw with B>0, thus PW is also a solution vector. 
If w, and W, are solution vectors, then one easily verifies that B,w,+ BW, with f; and 


wert 
\ 
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Figure 2.16 The solution space of Example 2.9 


B;>0 is also a solution vector. A space having these properties is called a convex 
cone. In Figure 2.16 we have given the solution space of Example 2.9. 

The quotient of the ‘volume’ of the solution space § and the volume of its 
complement W —S is finite; thus if we could select weight vectors randomly we 
would have a finite probability of selecting a correct solution vector. In the learning 
process we do better because we never select a weight vector that has been chosen 
previously (Theorem 2.4), and moreover after each adaptation we move with the new 
weight vector in the direction of the solution space, as should become clear from the 
next example. In our previous discussion it should have been noticeable that we did 
not build our theory explicitly on binary valued input vectors. It turns out that lincar 
threshold functions can also be functions of real-valued input vectors. In the next 
example we will use real valued inputs. 


Example 2.11 


Assume we want to learn a simple threshold function with one-dimensional real-valued 
input vectors such that the output equals 1 for x, =(1) and for x, =(1.5), whereas the 
output must be 0 for the inputs x, =(0.25) and x,=(—0.5). 

In Figure 2.17 we have given for the extended input space the solution space 
for X*={8,, 8} ={[1, ICL 1.57} and £= ff, &4}={[1, 0.25] £1, —0.5]'}. 
Because the output for x, must be 1, we must have for the solution weight vector s: 
s*%, >0, and thus s must be located to the right of line /,. For input %, the output 
must be 0 and thus s+, <0 and thus s must be located to the left of the line l, or 
on the line l}. We can do the same for the other inputs. The intersection of the 
separate solution spaces gives the solution space, indicated by the shaded area in 
Figure 2.17. 
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Figure 2.17 Construction of the solution space of Example 2.11 








Figure 2.18 The construction of the sequence of weight vectors of 
Example 2.11 


Let the initial weights be [wo(0), w,(0)] =(2, 1), then the initial separating hyperplane 
Ĥ(0) is defined by 2xy+x,=0 (see Figure 2.18). With this initial hyperplane 
the extended inputs &, and &, give the wrong output. If we use the procedure for 
global learning we have to subtract from W(0) the vector e(%,+%,). With e=1 the 
new weight vector becomes W(1)=(2, 1)—(2, — 0.25) =(0, 1.25). We see that the weight 
vector is changed in the direction of the solution space. The corresponding separating 
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hyperplane is now identical with the x-axis, and only for the input vector $, =(1, 0.25) 


do we obtain a wrong output because wo(1)xọ + w,(1)x, =O + (1.25)(0.25)>Oand hence , 


y=1. Now we subtract from W(1) the input vector &,=(1, 0.25). The next weight 


vector becomes W(2)=(-—1, 1) atithe border of the solution space. The output for the ` 


input vector Å, is now just wrong: y=0. Thus in the next step we add $, =(1, 1) to 
the weight vector and we obtain W(3)=(0, 2). Input vector $, is now misclassified, so 
we have to subtract from W, the vector *,;=(1,0.25); the next weight vector 
W(4) =(— 1, 1.75) is in the solution space. a 


The question arises of whether we will always enter, in a finite number of adaptation 
steps. the solution space. One can prove that for any constant learning rate £ (called 
fixed increment learning) this will be the case - even for a time-varying learning rate 
like e(k)=1/k or ek)=k. These statements are consequences of the Perceptron 
convergence theorem, which will be discussed in the next section. 


2.5 The Perceptron convergence theorem 


This section mainly deals with the formal statement of the Perceptron convergence 
theorem and its proof. Because we have already outlined the theorem in the previous 
section in an informal way, this section can be omitted on a first reading of the book 
without loss of continuity. 

The Pereeptron convergence theorem concerns the convergence of the learning 
procedure to find, from samples of correct behaviour, the linear threshold function 
Y=S(wot WiXi WX + ++ +19,X,) realized by a single binary Perceptron, if the 
function y= f(x,, X3,.-., Xn) to be identified is a linear threshold function (the symbol 
S represents the step function). 

In the previous section the variables y and x; were binary valued; the Perceptron 
convergence theorem is, however, also applicable if the variables x; are real valued. 

The reinforcement learning rule is given by: 

Let w(0)=(sv,(0), w,(0),..., 1,(0)) be any initial weight vector. 

Let w(K) =(wo(k), w,(A),..., w,(k)) be the weight veetor at step k. 

Let a(k) be a variable learning rate. 


Local learning 
If at step ki wo(k) tw (xy, WAAN to Fw, (AN, <0 and it is given that 
y= f(x)=1 for some input vector x;=(x;,, Xj2.---, Xin), then change Wk) into: 

Wik + 1) = w(k) + cf{k)X; 


If at step Az wolk) t wilka w(x + oo Fae, (Aly, > 0 and it is given that 
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_y=f(x)=0 for some input vector X; =(X;1, Xiz- -+> Xin), then change W(k) to: 


a(k + 1)=W(k) — (KR; 


Global learning 


If &(k) =LX;- LR, with eT *(k) and & eT (k) with T *(k) the set of extended input 
vectors with target value 1 and actual output of the neuron at step k equal to 0, and 
T ~(k) the set of extended input vectors with target value 0 and actual output of the, 
neuron at step k equal to 1, then change the weight vector into: 


Wk + 1) = Wk) + e(k)e(k) 
The Perceptron convergence theorem states: in the case of local or global learning 


the weight vector W(k) converges to a solution vector § if the samples are linearly 
separable and if the following conditions on the learning rate e(k) are satisfied: 


1. e(k)=0 
2. lim y e(k)= œ 


moc k=l 


5 (ek)? 
3. lim += =0 


m 2 
a X a) 
k=1 


The conditions imply that convergence occurs for any positive constant learning 
rate ¢ or if e(k)=1/k, or even if it increases like e(k)=k. If the set of samples is not 
linear separable, then the ‘separating’ hyperplane defined by Zio wx=0 will 
‘oscillate’ between several positions if the learning rate is constant or increasing. The 





same occurs when the data set contains contradicting samples, so we can formulate 


the following practical statement: 


Practical statement 2.1 


A decreasing value of the learning rate e(k) is particularly advisable if the set of samples 
may not be linear separable or contains contradictions, because in that case the effect 
of ‘disruptive’ samples will be reduced. 


(Note: For a formal proof of the theorem it is required that during learning for correct 
classification the inner product W-%>6 for xeT,* and wx <ò for xeT, with da 


certain small positive constant.) E 
We will not give a general proof of the theorem and restrict it to the case where 


e(k) = 1/le(k)| for global learning. 








wart 


34 The binary Perceptron 


Proof 


If the learning process converges, then there exists a solution vector § such — 


that after some finite time m: W(m)=S. If ŝ is a solution vector, then the unit vector 
u=$/|§| is also a solution vector’ Thus W(m) = Ad for some A>0. If Wm) =a, then 
a-W(m)/|W(m)| = 1. At every step k we have for the value of the cosine of the angle 
between û and w(k): a-W(k)/|W(k)| < 1. We will show that at each step &-W(k)/|W(k)| will 
increase with a positive amount and thus after a finite number of steps must become 
equal to 1. 

Consider t-w(k + 1) = G-{W(k) + e(k)é(k)}. Because we take e(k)= 1/le(k)| we obtain 
GeW(k + 1) = a-W(k) + ûê(k)/lê(k)|. Because êlk) = f; — E $; with KT, *(k) and KET, “(k) 
we have dré(k)=d(k)>0. Let 6’= min, 5(k)/le(k)}. So at each step the inner product 
G-w(k) is increased with at least the value 6’, thus d-w(k)> ko’. (Note that if we take 
e(k) > 1/|e(k)|, then the increments at each step will be larger.) 

Now consider |W(k + 1)|? =|W(k) + e(k)@(k)|? = |W(A)? + 2%(k}-êlk)/lêlk) +1. Because 
W(k)-8(k) <O we have |W(k + 1)|? <|wW(k)|? + 1. This implies |W(k)|? <|W(0)|? +k and thus 
|W(k)| <(\W(0)|* +k)'/?. The expression a-W(k)/|W(k)| can thus be approximated by: 
“W(k)|/W(A)| > 5’k/(|w(0)|? + k)!/? and thus after a finite number of steps k the value of 
“W(k)/|W(k)| must become equal to | and thus a solution vector must be reached. 


QED 


We have found that after a finite number of steps the reinforcement learning rule 
will provide a correct solution if certain conditions are satisfied. There are, however, 
other learning rules that will give correct solutions under certain conditions. These 
rules are also based on the gradient descent procedures for minimizing certain criterion 
functions, like cost functions (for misclassification), sometimes called error functions 
or energy functions. 

The criterion function E(w) is a positive scalar function which is zero if W is equal 
to a solution vector. With the gradient descent procedure we move through a sequence 
of weight vectors Wo, W,, W2,... such that E(Wy)> E(W,) > E(w.)> ++ and finally end 
up with E(W,)=0. The procedure starts with some arbitrary weight vector Wo, then 
computes the gradient vector VE(W). The next weight vector is obtained by moving 
in the direction of the stcepest descent, i.e. along the negative of the gradient vector 
VE(Wo). . 

An obvious choice for E(w) would be the number of misclassified input vectors for 
the weight vector w. But in that case E(w) would be piecewise constant and the 
gradient of E(w) is zero or undefined, so this would be a poor candidate for the 
criterion function. 

The reinforcement learning rule used before is obtained if we minimize the 
Perceptron criterion function, i.e. 


E(WW)= Y -êk with T,=TZu{—Ty} 
xeT, 


Because W-X is negative for each xef, the criterion function E(w) is always positive 


_ proportional to the negative value of the gradient vector V{E(w)}. The ith component ` 
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if there are misclassified input vectors, and it will only be zero for a solution vector 
s. Because AE(W)=V{E(W)}AW we have the situation that E(w) will reduce if AW is 


of the gradient vector GE(W)/dw; is equal to x;, thus we obtain for Aw: 1 E / 
Aw=e } È f 
kef, 


as we used in the reinforcement learning rule. 
Another criterion function could be, for example: 


E(W)= $ (Wx) with T,=TyU{—Ty } 
ket, 
Although this criterion function can be used, it turns out that the corresponding 
learning rule is inferior to the reinforcement learning rule. 


2.6 Performance of a two-layer binary Perceptron 


In the previous sections we found that a single-neuron binary Perceptron can realize 
linear threshold functions with respect to the binary variables x,, X2,..., Xa. However, 
most logical functions f: {0, 1}"— {0, 1} are not linear threshold functions with respect 
to X1, X23.: X,- In this section, we will show that any logical function can be realized 
with a two-layer binary Perceptron with one neuron in the second layer. _ 

First we define some concepts, using a terminology related to pattern recognition. 

The argument values of logical functions will be called patterns, so we have a 
pattern set: P= {0, 1}"; a pattern p; is an element of P, i.e. p;=[P,,, Pip- P} with 
pi€e{0, 1}. 

"The intersection r, of a pattern p;e{0, 1}" and a pattern q,e{0, 1}" is defined by 
peg;=",, with ry, =1 if p =1 and gq =1 and with r, =0 otherwise. 

The union r, of a pattern p,¢{0, 1}" and a pattern q,e{0, 1}” is defined by pjuqj=ry. 
With r,,=1 if p,,.=1 or g,,=1 or both and with r, =0 otherwise. 

The order of a pattern p; is defined as the number of 1s occurring in p; and is 
denoted by |p; B 

In the previous section we treated x, as a binary variable; however, it is helpful to 
consider x, as a function: x,: P+ {0, 1} defined by: 


XADi) = Pi, 
Because x(p,) depends on one component, or one pixel, from the pattern p; we will 


call x; a pixel function. l 
We introduce a special logical function which acts as a mask by which we observe 


a pattern. Let qeP, then xg: P—{0, 1} is a so-called mask function defined by: 


xa(P)=1 if q^p;=q and x,(p,)=0 otherwise 
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Example 2.12 
Xr poo(1110)=1 and x, ,99(1001)=0 a 


ia 
A mask function can be written 'as"'a product of pixel functions. Let qeP, then 
Xq(Pp} = y (Pn) Xu (Pa) Xa (Py) . f 


with 


A | =xX(Pa) ifq;=! 
XPa) = T \ i ifqg=0 


We will call q in x4(Pn) the mask of the mask function xq. 
Example 2.13 
For n=3 Xip =X1X3 E 
We define the substratum S, ofa pattern p; as the set of patterns: S, = {qx (P) = L}. 
Example 2.14 
Sov 11 = {0000, 0010, 0001, 0011} a 


We define the cover Cp of a pattern p; as the set of patterns: Cy = (qulXp(qy) = 1}. 


Example 2.15 
Coury = {0011 0111, 1O11, 11H} 2 
Ifa logical function y: {0, 1}"{0, 1} is written as an arithmetical linear combination 
of the mask function: 
wp) == w;x,(p,) with weR 


then we will call such a form an arithmetical conjunctive normal form. 
In the case where the logical function is written as: 


YP) = S(Z wx, (p,)) with weR and S the step function 


then we call this form an indirect arithmetical conjunctive normal form. 
Now we can state the following theorem: 


Theorem 2.5 


Any logical function y: (0, fj)" (0, 1; can be written in an arithmetical conjunctive 
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Table 2.11 
xy X2 y 
E 
0 0 0 uar 
l 0 1 l 
0 1 1 ` 
1 1 0 


normal form: 
y= X Wa,’ Xa, q{0, 1}" 
q 


with w, 


a an integer such that for each pattern p,eP: 


> Ww, =)(p) with S, the substratum of p; 


q£Sp, 
Before proving the theorem we will give an example of the theorem. 
Example 2.16 


The exclusive-or function defined by Table 2.11 can be written as: 
Y=WooX00 + Wi0X10 + Wo1Xo1 FW11X11 
It turns out that Wo) =0, Wyo = 1, Wo, =! and w,, = —2. Thus: 


yHx, +x, —2X 1X2 a 
Proof of Theorem 2.5 


We have to prove that for any logical function y: {0, 1}"— {0, 1} we can find a unique 
set of coefficients w4, such that: 


E wa'Xa (P) =p) for each peP = {0, 1)" 
qeP 
Because x,(p;)= ! if qeS,, and 0 otherwise, we have for every p; 
y WqXq(P)= > Wa,Xq(Pi) 
qeP G€5p, 
For q,€S,, we have x,(p,)=1, thus: 


> Wa "Xq (Pi) = » Wa, 


GF5p, 9€5p, 
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Thus if: 


2 W4,= y(p;) for each p;eP 
QES ay h 


we obtain: j 


E wa'Xa (P) = yP: 
aP 


Moreover we have for a total function 2” patterns and thus 2” independent linear 
equations of the form: 


È Wa, = (Pi) 
GESp, 
Because the number of coefficients w, is 2”, the solution is unique. QED 


Example 2.17 


Given the function y: {0, 1}"-> {0, 1} specified by Table 2.12, with the conditions for 
= wą given in the last column. The solution for the set equations is: 


Wooo= 1 W100 =0 
Woo1 =0 Win. =! 
Wo1o= —1 W119 =0 
Wo11 =O Wii. =2 


Thus the function y can be written as a linear combination of mask functions: 


y=1— x3 — XX3 +2X1X2X4 a 


In Section 2.2 we defined a linear threshold function with respect to the binary 








Table 2.12 
x X2 X3 y 
Po 0 0 0 1 =Wo00 
Pi 0 0 l 1 = Wooo + Woo: 
P2 0 1 0 0 = Wooo + Wor0 
P3 0 1 1 0 =Woo0 + Woo1 Woro + Wort 
Pa 1 0 0 1 = Wooo + W100 
Ps 1 0 1 0 = Wooo + Woot +t W100 + W101 
Pe 1 1 0 0 =Woo0t Woot Worn + Wiio 
P- 1 1 1 1 = Wooo + Woo1 + Woro t Wort + Mioo + Wio Wio tM 
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variables x,,X2,..., X„ as a function that can be written as y=S(2 w,x;—T), with S 
the step function. 
From Theorem 2.5 the next theorem follows immediately. 


Theorem 2.6 


Any logical function y: {0, 1}"—>{0, 1} is a linear threshold function with respect to 
the set of binary mask functions, i.e. 


y=} Wa Xa, 
q; 
or 


y=s( X Wa "Xq,— r) with threshold T = — woo...0 


9,440 


Example 2.18 


The exclusive-or function of Example 2.16 is a linear threshold function with respect 
to the binary mask functions: Xoo = 1, X;o=X1. Xo, =X2 and X11; =X 1X2: 


y=S(x,+xX2—-2x,x,—T) with T= Wop =0 ] 


A central theorem is as follows: 


Theorem 2.7 


Any logical function f: {0, 1}"—{0, 1} can be realized by a simple two-layer binary 
Perceptron. 


Proof 


Any logical function y: {0, 1}"—{0, 1} is a linear threshold function with respect to 
the binary valued mask functions: 


y=si Y was —T) with threshold T = — woo...0 
q;* o 

Given the binary valued mask functions xg, one single binary output neuron can 
realize the linear threshold function with respect to the mask functions xg. The 
threshold of the second-layer neuron is equal to —Woọo...o of the linear threshold 
function y. 

Any mask function xg=x;x;"""x, is a linear threshold function with respect to 
Xi Xjes- Xp because xq=S(x;+xj+ t +x,—T) with T equal to the number of 
variables in the product x, =x,x,;-"* x, minus 1, and thus it can be realized by a single 
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binary neuron in the first layer. Such a first-layer neuron has an input x; for every 
x; occurring in the product x, with a corresponding weight w,;=1. The output of 
such a first-layer neuron equals 1 if xq(p;)= 1, and 0 otherwise. 

All neurons realizing mask functions constitute the first layer of the Perceptron. 
The output of a first-layer neuron réullzing a mask function x, is multiplied by the 
synaptic weight wg of the connection to’ ‘the output neuron. 

Thus the output of the second-layer neuron equals 1 if and only if: 


L wa'Xq> — Woo...0 QED ` 


q 


Figure 2.19 gives the two-layer binary Perceptron which realizes the logical function 
described in Example 2.17. Figure 2.20 gives an alternative realization of the same 
function. 

In Theorem 2.5 we found that any logical function can be written in an arithmetical 
conjunctive normal form. The proof of the theorem also revealed a method of how 
to find that arithmetical conjunctive normal form. There is, however, another 
method to find the arithmetical conjunctive normal form. If we have written a logical 
function as a Boolean function then we can convert it in a systematic way into an 
equivalent arithmetical function using the following rules: 


1. If x; is Boolean function (a single variable), then x; is an equivalent arithmetical 
function. 





Figure 2.19 Two-layer binary Perceptron illustrating Table 2.11 
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Figure 2.20 Alternative configuration of the binary Perceptron of 
Figure 2.19 


. If š; is Boolean function (the inverse of x;), then 1 —x; is an equivalent arithmetical 


function. 


_ If the Boolean function f, is equivalent with the arithmetical function fi and 


the Boolean function f, is equivalent with the arithmetical function f}, then the 
Boolean function f, A fz is equivalent with the arithmetical function f% f2- 


. If the Boolean function f, is equivalent to the arithmetical function f and the 


Boolean function f, is equivalent to the arithmetical function f, then the Boolean 
function f, v f> (also written as fı + f2) is equivalent to the arithmetical function 
S(f, + fh) with S the step function. 


_ If the Boolean function f, is equivalent to the arithmetical function f, and the 


Boolean function f, is equivalent to the arithmetical function f, and the Boolean 
functions f, and f> are not true for the same arguments, then the Boolean function 
fiv fa is equivalent to the arithmetical function fi + fs. 
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Table 2.13 


x1 


x 
w 

=x 
w 
< 


-=== COCO 
-= 0900m = OS 
~o-oH+oKo 
KCOO-CORFF 


Example 2.19 


Consider the logical function specified by Table 2.13. This function can be specified 
by the Boolean function: y=xX,X.X3+%,X.x3+%,X2X3+X,x2x3. Because one and 
only one argument of y can be truc we can use rule (5) above and obtain the equivalent 
arithmetical conjunctive normal form: 


y=(l—x,)(1 —x2)(1 — x3) + (1 — x) —x2)x3 +, (1 — x2)(1 — x3) +1 x23 
or 


y= 1—x,—X,X3+2x,x2x3 as found before in Example 2.17 E 


A logical function can also be considered as a characteristic function of some set QS P. 
A characteristic function of a set QSP is a logical function fg such that fo(p,)=1 if 
p.cQ and f(p))=0 otherwise. As a consequence of the conversion rules stated above, 
we have the next theorem. In Theorem 2.8 we will use the complement of p; denoted 
by p; (e.g. if p= (O10 then p,=<¢101)). 


Theorem 2.8 


A characteristic function fp: P—{0, 1} for the singleton K = {p;} can always be written 
in the arithmetical conjunctive normal form: À 


E (-1e 
GRE Cp, 
with C, the cover of pattern p; and |q,,pj| the order of q,,p; (i.c. the number of 
ones occurring in q,,p,)- 


Proof 


Consider x; as a Boolean function, then any characteristic function fp, of the singleton 
wo ; ar x 4 Bzg = wih = : 
{pj} can be written as a Boolean product fp =<,X.°--X, with %;=x, if p=! and 
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- X=, (the negation of x,) if p,=0. (For example for n=3 and pattern p;= <101>: 


f Pex 1%2X3.) : i R 
‘The Boolean function x; is equivalent to the arithmetical pixel function x;. The 
Boolean function x, is equivalent to the arithmetical function (1—x,), thus if we 


replace the Boolean variables in the Boolean product by the corresponding. 


arithmetical functions we obtain an equivalent arithmetical product. The elimination 


of the brackets in this product (and some thinking) gives the desired result. QED’ 


Example 2.20 


Let fË =X,X,x3x,4 be the Boolean form of the characteristic function of the singleton 
{p;i} = {0011}. The complement of p; is: p; = 1100. The cover of p; is: C, = {0011, 1011, 
0111, 111}. Thus we can replace the Boolean function by the equivalent arithmetical 
function: 


foori =(- i uae sere +(- 1) xr 
+(- 1h X64 uti 1) haan EET 
or simply: 
Sp, =X 3X4 —X1X3X4 —X2X3X4 +X1X2X3X4 (note: X,XXx3X4=(1—x,)(1 —x2)x3X4) 
E 
Theorem 2.8 can be extended to sets of patterns as follows: 
p 


Theorem 2.9 


A characteristic function fg: P+{0, 1} for a class of patterns K S&P can always be 
written in the arithmetical conjunctive normal form: 


Sk= 2 Jo: 
pEK 
with: 
R= E (0 hn 
GnEC 5, 
Proof 


If fs is the characteristic function of a set of patterns S (e.g. a singleton {p;}) and f, 
the characteristic function of the singleton {p,}, then one easily verifies that Ss+ ho, 
is the characteristic function of the set SU{p,}. QED 
Example 2.21 


For n=2 let K ={00, 10}. The characteristic function of {00} is 1—x,—x2+xX1X2 





; A ; 
uy Y 
i 
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and the characteristic function of {10} is x,—x,x,. The characteristic function of 
K ={00, 10} becomes 1 — xy. E 
A mask function xq is the characterjstic function ofa set of patterns Kq defined by: 
Ky = {plxq(P)= 1 } 


Because this definition of Ky is identical with the definition of the cover Cy of pattern 
q, we have the following lemma: 


Lemma 2.3 
The mask function x, is the characteristic function of the cover Cq of pattern q. 


This lemma gives us a convenient set-theoretical interpretation of the arithmetical 
conjunctive normal form of a logical function y, and a way to determine the arguments 
for which y= 1. In a subsequent section we will see how it can also give us the means 
to show that a two-layer binary Perceptron is able to generalize from samples of 
desired behaviour. 


Example 2.22 


The arithmetical conjunctive normal form of the exclusive-or function is as follows: 
VHX, +X,—2x,X, 
Thus y=! for the elements of the set specified by: 
Cro $ Coy —2C 4, = { IO, 11} + (01, 11} —2-{ 11} = (10, OF} 
In this example we used the operations of addition, subtraction and multiplication 


of sets. These operators must be defined in a formal way. We will come back to it 
in Section 2.8. B 


Because of Theorem 2.8 a characteristic function f, of a pattern p; can be written in 
an arithmetical conjunctive normal form: 


= (- 1) 920 x 


at Cy 


and thus the class containing only pattern p; can also be written as an arithmetical 
sum of the covers Cy (the set of patterns covering pattern q,,) represented by the 
mask functions in the arithmetical conjunctive normal form: 


imi = X ENC, 
t 
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Example 2.23 


For n=4 and pattern p;=0011 the characteristic function is as follows: 
Jo, = X3Xa — X1X3X4 —X2X3X4 HX X2X3X4 
Thus 
{Pi} = Cooi11 — Cio11 ~ Cor + Criss 


Hence 


p} ={0011}={1111,0111, 1011, 0011}—{1111, 1011}—{1111,0111}+{1111 
5 j 


2.7 The adaptive recruitment learning rule 


In the previous section we found that any logical function y: {0, 1}"—> {0, 1} can always 
be written in the arithmetical conjunctive normal form, and that every arithmetical 
conjunctive normal form can be realized by a two-layer binary Perceptron. 

A two-layer neural network that realizes such a logical function by some linear 
threshold function with respect to the mask functions x,: 


= s(3 È wa'XalP)— r) 


must have for each qe{0, 1}" for which wg #0, an input neuron realizing Xq, moreover 
the weight of the connections from the output of a first-layer neuron realizing xq 
to the input of the output neuron must be equal to wg. 

Assuming we do not have an explicit description of the threshold function but only 
a finite set D, the training set, of examples consisting of pairs <p;, y(p;)>, then we 
investigate whether we can develop a learning rule leading to the recruitment of the 
required first-layer neurons and correct values of the weights w, such that at the end 
of the learning process the two-layer binary Perceptron will at least give the correct 
response to all patterns of D. 

A logical function y: P>{0,1} with P={0, 1}" can be considered as a pattern 
classification function, i.e. y(p;) = 1 if p; belongs to some subset K of P and y(p,) =0 if 
p; belongs to the complement K = P—K. 

The subset of patterns in the given data set D that are elements of K will be called 
the set of examples E, and the subset of patterns of D that are elements of K will be 
called the set of counterexamples. 

The learning rule we will present will give the correct response to the training set 
D after a finite learning time and will generalize in some sense on the basis of the 
training set to other inputs not present in the training set. In contradistinction with 
learning rules for other artificial neural nets, we have to present every element of the 
training set D just once. This implies that the learning time is proportional to the 
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number of elements in D. As far as we know the learning rule presented here is new 
and has not been published before. 

Assume we want to learn a logical function y: {0, 1}"> {0, 1} and we have a finite 
set of examples (set E) and counterexamples (set F). 


j 


The adaptive recruitment learning rule 


1. Given initially an arbitrary two-layer binary Perceptron with in the first layer an 


arbitrary number (it might be zero) of neurons realizing mask functions xg with 
qe{0, 1}". The outputs z, of input neurons are multiplied by arbitrary weights wą 
and connected to one single output ncuron with arbitrary valued threshold T. 

2. Present all examples and counterexamples in the order of the number of ‘ones’ 
occurring in the set D = EUF. 

3. If an example or counterexample is correctly classified, go to the next element of 
the ordered set D. 

4. Ifa pattern p is presented and incorrectly classified and there exists no first-layer 
neuron that realizes the mask function xp, introduce such a neuron. Change the 
weight wp to wq +A with A such that the output of the output neuron becomes 
correct. (A is positive if p belongs to E and the output was 0; A is negative if p 
belongs to F and the output was 1.) 

5. Go to the next element of the training set D. 


Before proving that after learning the set, D is correctly classified, we will give a 
simple example. 


Example 2.24 


Assume we want to identify the logical function such that y=1 for the elements of 
K ={0100, 1001, 0101, 0110, 1101, 1011, O111, 1111} and thus y=0 for the elements 
of K = {0000, 0001, 0010, 0011, 1000, 1010, 1100, 1110}. 

Assume we do not know K but only the set of examples: E = {0100, 1001} and one 
counterexample F = {1100}. 

Assume we start the learning process with a neural net without any first-layer 
neurons and only an output neuron with threshold T =0. 

We start the learning process with the example 0100. The output is incorrect so 
we have to introduce a first-layer ncuron that realizes the mask function Xg,o9. For 
the weight we will obtain wo;99 =! (see Figure 2.21). 

In the next learning step we take the counterexample: 1100. We observe that for 
the neural net obtained after the first step the output for input 1100 is wrong: y= 1. 
We have to introduce a second first-layer neuron that realizes the mask function 
X100 With a weight W,,99= — 1 (see Figure 2.22). 

In the third step we take example 1001. lor the neural net of Figure 2.22 we obtain 
for the input 1001 the output y=0. Thus we have to add an additional first-layer 
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Figure 2.22 The neural net after the second learning step 


neuron that realizes the mask function X,99, with a weight Wioo1= 1 (see 
Figure 2.23). Because we have presented all examples and counterexamples, we are 
at the end of the learning process. One easily verifies that now the output y of the final 
neural net is equal to 1 for all elements of K and y=0 for all elements of K. 

At a first glance one might be surprised that in the previous example we could 
identify the logical function y just with two examples and one counterexample. But 
the example was not fair because the unknown function could as well have been 
defined as y=1 for the set: 


K'=EU(K —F) 

= {0100, 1001} {0000, 0001, 0010, 0011, 1000, 1010, 1100, 1110}— {1100} 
and y=0 for the set: 
K'=FU(K—E) 

= {1100}U{0100, 1001, 0101, 0110, 1101, 1011, 0111, 1111}—{0100, 1001} 


Learning with the same sets of examples E, and counterexamples F, would result 
in the same neural net but with a wrong response for all inputs except for the elements 
of E and F. E 


Although in an ideal learning situation one wishes to generalize from a restricted 
set of examples and counterexamples, the previous example gives ground to the 








f : 
ia Ý 
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Figure 2.23 The neural net after the third learning step 


following general hypothesis: 


Generalization by learning from examples and counterexamples is in gencral impossible 
without utilizing a priori knowledge about the properties of the function to be identified. 


We will come back to this subject later. We will first present a proof of correctness 
of the adaptive recruitment learning rule. 


Proof of correctness of the adaptive recruitment learning rule 


We have to prove that after learning, the set EUF of examples E and counterexamples 
F is correctly classified. This implies that if E=K and F=K we can identify any 
logical function exactly. 

Let R(k) be a subset of D which is correctly classified after step k. Assume we 
present at step k+1 an clement pe. After step k+1 the linear threshold function 
realized by the neural net will have the form: 


WkK+1)= s( X WaXa + Wp,Xp, 


qo 
1/p, 


-7) for some set Q 


Due to the ordered presentation of examples and counterexamples during the learning 
process, we have for every p,eR(k) that the number of Is occurring in p; is smaller 
than the number of Is occurring p; or if that number is the same, then p; £p; and 
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thus x,(p))=0 for all peR(k). Thus y(kK+1)=y(k) for all peR(k) and henee 
R(K)E R(k— 1). After step k+ 1 the weight w,, will be such that p; is correctly classified 
and thus R(k+1)=R(k)U{p,}. The same reasoning holds if we present at step k+1 
a counterexample. So we finally will end up with the situation that D is correctly 
classified. QED. 


If the initial net contains no neurons in the first layer, and all initial weights are 
zero, and the initial threshold of the output ncuron is zero, and if we take for A in 
rule 4 the smallest integer satisfying the condition mentioned, then we will call the 
applied learning rule the proper adaptive recruitment learning rule. 

It may be worthwhile noting that the linear threshold function realized by the 
adaptive recruitment rule may contain less terms than the arithmetical conjunctive 
normal form obtained by the procedure mentioned in the proof of Theorem 2.5. 


Example 2.25 


For the binary ‘or’ function we will find with Theorem 2.5 the following conjunctive 
normal form: 


Y=x,+X2—-X,X2 


The linear threshold function realized by the proper adaptive recruitment learning 
rule will be: 


y=S(x, +X2) = 


2.8 Generalizing with a two-layer binary Perceptron 


An ideal learning performance in pattern classification would be when correct 
classification of the patterns of a class K occurs after a learning phase in which the 
patterns of a finite proper subset of K and patterns of a finite proper subset of the 
complement of K (the counterexamples) are presented to the learning system. This 
will frequently occur with the two-layer binary Perceptron, as in Example 2.24, but 
in general we cannot guarantee that the obtained classification for the set K is correct 
unless E=K and F=K 

When, however, some a priori knowledge about the relation between the set of 
examples E, the set of counterexamples F, and the class K of patterns to be identified 
can be taken into account, then correct classification can in general be learned from 
a proper subset of K and a proper subset of K. 

We investigate the properties of the class L for which the output y of the binary 
Perceptron will be | after the learning phase. In order to formulate a theorem relating 
to the properties of class L we have to introduce some new concepts. 

In regular set theory a set A is defined as a collection of distinguishable objects 
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a; each object occurring once in the set. In the subsequent discussion we need sets 
in which an object a; can occur a9 +a; times, with a and «eR. For this purpose we 
define an extensive set A of a regular set A: A= {do, 141, %24),...,0,a,} with the 
following properties: 

If A= {a9, 4141, 82423... 0 a a hnd yeR, then yA = {yao, YAA 1, YA2035.-., YO,0,}. 

if A={ao, 014), 42475.. raa} and B= {Bo Bibi, Brb2,---s Brom}, then A+ B= 
{ao + Bo, 0141, 2025- . 3 nans Bibi, Baba,- -s BmOm}. Note that if aj=b,=z, then 
(a; + B))zeA +B. 

We can convert an extensive set into a regular set with the set step function S 
defined as: 


S(A) = S{&0, 4141, %242,..., Xpan} = (ajloq +x; > 0} 


We can now give a theorem concerning the set of patterns accepted by the binary 
Perceptron after learning. 


Theorem 2.10 


If E is the set of examples and F the sct of counterexamples and we use the adaptive 
recruitment learning rule, then after learning, the output of the Perceptron will be 
equal to t for elements of some set L and will be zero for the set P—L, with L: 


L=5( È Ca È aC) 

qEE' QEF 

with S the set step function, C, the cover of example q; and C, the cover of 
counterexample qj, with a,>0 and «;>0 and E’ a subset of E and F” a subset of F. 


Before presenting the proof of Theorem 2.10 we will illustrate this theorem with 
two examples. 


Example 2.26 


Let E={010} and F={110}, then after learning the Perceptron will realize the 
following linear threshold function: 


y=S(X2—X 1X2) 
According to the theorem we obtain for the set L: 
L=S(Co10 — Cy 10) = S({010, O11, 110, 111} — {110, 111})= {010,011} m 
Example 2.27 
Let for n=3 the set of patterns be: K = {<010), <011>}. Assume we start learning 


with an initial neural network containing in the first layer a neuron that realizes the 
maskfunction Xoro and that it is connected to the output ncuron with a weight 


function: 


Generalizing with a two-layer binary Perceptron 51 


Wo10=2. Let the set of examples be E={<011>} and the set of counterexamples: 
F ={{110), (111)}. After learning we obtain a neural net realizing the linear threshold 


y=S{2x_— 2x 1X2 +X2X3 —X1X2X3}- ne 
For the class L we obtain: . 
L=S2Coi0—2C110+Co11— C111) 
= S(2{010, O11, 110, 111}—2{110, 111} + (O11, 111}— {111} 
= §(2{010}, 3(011}) 
={010, 011} m 


Proof of Theorem 2.10 


By inspection of the adaptive recruitment learning rule we see that for each example 
q.¢E, a first-layer neuron that realizes the mask function x, with a corresponding 
weight w, >0 will be introduced if the example does not already give the correct 
response. If the example has already been accepted, no first-layer neuron will be 
introduced and wą, =0. The same holds for a counterexample qj, but now with w,,<0 
or w, =0. The output neuron realizes a step function with some threshold T. Thus 


q, 
after learning the Perceptron will realize a linear threshold function: 


o=s( > WwaXalP) + pa Wa Xa; wT) with E’CE and F'SF 


q EE 


A mask function x, will have the value 1 for all elements of the cover Cg. Thus for 
a pattern p accepted (y=1) by the Perceptron we have: 


s( Y wat e-T)= I 


qEE' act 
PEC q; pec, y 


For a pattern p rejected (y =0) by the Perceptron we have: 


s( y wt ¥ wT) =0 


a Kh 
Thus if pattern p is accepted and is a member of some cover C, (respectively C,), 
then we can equivalently count that pattern w, (respectively -w a) times and add 
these numbers w4, (respectively —w,) to one total number f and ‘subtract from B 
the threshold T. Now we can say that pattern p is 8B—T times a member of the 
extended set Ĉ. In the case that p is accepted we must have that 8— T>0 because 
S(B—T)>0 and thus peS(£). In case that pattern p is not accepted we obtain 
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similarly B—T <0. Thus with a,=w, and «j= —w, we obtain: 
L=9( 5 aCy- > C.-T) QED ~- 
: gel? ger : 


4 
2 R 

If the set L is identical with the be K that has to be identificd we say that K is 
coverable by the set of examples and counterexamples. 

We conclude that with the adaptive recruitment learning rule the Perceptron can 
be generalizing correctly from sets of examples and counterexamples. However, in 
general it might be hard to determine beforehand whether the (unknown) class K to 
be identified will be coverable by the given sets of examples E and counterexamples F. 

Theorem 2.10 implies that any logical function can be written as: 


y= s( Es WaXq+ È, WyXq) with y(p)= 1 iff peK 


qE qeF 
with w, 20 if pattern qeE CK 
and w <0 if pattern qeF SK 
This statement can also be proven without relying on the adaptive recruitment 
learning rule, so we have the following theorem: 


Theorem 2.11 


For every class K of patterns there exists a linear threshold function: 


=s( E wX ¥ wara) with y(p)=1 iff peK 


qck qet 


with w, >0 if pattern qeF SK 
and wy <0 if pattern qeF CK 


Proof 


Let 


= s( È wase) | 


be some linear threshold function realizing the classification of class K. Assume wa SO, 
whereas pattern q;E€K. There exists, however, a characteristic function for pattern q; 
of the form (Theorem 2.8): 


ks = X (—1) aoe. on 


Gad Og 


Now w,X,, is a term occurring in the E of f with wy = 1. The characteristic 


AE RN 


| 
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function fọ} of q; can be added to f' without changing the classification performed 
by f’ (Theorem 2.3). The same holds for ô- fọ, with 620. 

We can always select ô such that the new synaptic weight of x4, in f’, w= Wq +6, 
becomes zero or positive. If w, becomes zero, then the mask x4, can be eliminated 
from f’ and hence qje¢K does not occur in E. 

Similarly, if w, 20 in f’ whereas pattern q; EK, we can change the synaptic weight 
to a negative or zero value w, =w +ô with 6<0 without changing the classification 
function. If w, becomes zero, “then the mask x, can be eliminated from f’ and hence 
q.¢K does not occur in F. 

If we change the weights in such an order that wg is altered before wy if Iq < lgj], 
then we avoid the alteration of weights changed before. QED 


Example 2.28 


Let for n= 3 the class of patterns be K = {<010), O11}. A threshold function realizing 
the classification of class K is as follows: 


y=S(2x_— 2X, X3 —X2X3 +X1X2X3) 


Note that the weight of x X3 =Xọ;1, is negative, whereas (011)eK. 

The characteristic function for the pattern O11 is f =x,x3—x,x,x 3. We can add 
this function to the argument of the step function in y without changing the 
classification function y. We obtain: 


y =S(2x2—2x,x3) m 


2.9 The recruitment and reinforcement learning rule 


The adaptive recruitment learning rule discussed in Section 2.7 is very fast because 
we have to present all elements of the set of examples E and all elements of the set 
F of counterexamples just once, and we certainly obtain, after learning, the correct 
response for elements of E and F. A disadvantage of the adaptive recruitment learning 
rule is that we have to order the learning set D=EUF according to the increasing 
number of 1s occurring in the binary vectors. 

We can, however, also train a two-layer binary Perceptron with the reinforcement 
learning rule introduced in Section 2.4. In that case we do not have to present the 
samples in some fixed order, but on the other hand we must apply the whole set 
of samples many times during learning. The fact that we can use the reinforcement 
rule for training a two-layer binary Perceptron will become clear if we realize that 
any logical function y: {0, 1}"{0, 1} is a linear threshold function with respect to 
the set of all mask functions Xx, (see "Theorem 2.6). However, if we want to train a 
two-layer binary Perceptron with the reinforcement learning rule we must have in 
the first layer a neuron for every mask function xg with qe{0, 1%”. For any realistic 
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application n will be large and will thus require a tremendous number of first-layer 
neurons (for n=10 the number of first-layer neurons will be 1024). 

After learning with the reinforcement learning rule with a first-layer neuron for 
each mask function xą, it turns out that a great number of first-layer neurons can 
be removed because the correspdhding weight wg will be zero. 

We can do better by not introducing first-layer neurons if that is not necessary. 
This principle is used in the next learning rule. The outputs of the first-layer neurons 
will be represented by the variables z,, Z3, etc. 


The recruitment and reinforcement learning rule 


Step 0 There is no first-layer ncuron and there is one output ncuron with threshold 
zero. (The output neuron has a constant threshold input z)=1 connected 
to some initial weight wy.) 

Take randomly an clement q of the training set EUF. If the output y(k) 
of the output neuron is incorreet and there exists no input neuron that realizes 
the mask function x, then introduce such a neuron first. If there exists an 
input neuron that realizes the mask function Xq and the output y(k) is 
incorrect, then change the extended weight vector W(k), composed of the 
ordered set of all weights w, including the threshold weight wo, to: 


wk + 1) = wk) + e(k)z(k) 
Wk + 1) = w(k) — e(k)a(k) 


Step k 


if y(k)=0 whereas qeE 
if y(k)=1 whereas qeF 


with ¢(k) the learning rate (see Section 2.4) and 2(k) the extended vector 
composed of z and the ordered set of all outputs of the first layer neurons 
introduced so far. 


As stated before we do not have to order the learning set D when we use the recruitment 
and reinforcement learning rule, but we now have to present the whole sct D many 
times in order to obtain the situation such that the set D is correctly classified. 

We can use local and global learning, and the learning rate ek) may be fixed or 
time varying, as discussed in Section 2.4. The correctness of the recruitment and 
reinforcement rule is based on the correctness of the reinforcement rule discussed in 
Sections 2.4 and 2.5. The difference is the recruitment during learning of first-layer 
neurons that realize mask functions, [very time we introduce a new first-layer ncuron 
we can conceive (hat configuration as a new initial situation for learning a linear 
threshold function with the reinforcement learning rule. The Perceptron convergence 
theorem (sce Section 2.5) docs not, however, depend on the initial situation; we only 
have to guarantee that during learning at least all mask functions required for 
identification of the linear threshold function are realized by neurons in the first layer. 
If any of the required mask functions are not realized by the neurons in the first 
layer, then the output y cannot be correct, but in that case we introduce the missing 





The recruitment and reinforcement learning rule 55 


Figure 2.24 Initial configuration of Example 2.29 





1 W,=1 
x, O = y 
z. 


1 


Figure 2.25 The neural net after the first learning step 


neuron in the first layer with the recruitment and reinforcement rule. It might be 
that we introduce a redundant number of first-layer neurons but many of them will 
be eliminated during learning because the corresponding weights to the output neuron 
will become zero. 


Example 2.29 


Suppose we want to learn with the recruitment and reinforcement learning rule the 
exclusive-or function, with E= {[01], [10]} and F= {[00], [11]}. We start with the 
neural net of Figure 2.24 and take the learning rate equal to 1. We apply example 
[10]. The output y=0 is not correct. We have to introduce a first-layer neuron that 
realizes the mask function x,9. The output of the neural net for example [10] is now 
still incorrect because the introduced neuron is not connected to the output neuron: 
the weight w, =0. For the input [10] we obtain for the output of the introduced 
neuron: z, = l and thus 2'=[1, 1]. The new weight vector becomes W(2) = W(1) + 2(1) = 
(0, O}}+[1, 1J'=[1, 1]. The new neural net is given in Figure 2.25. Next we apply a 
counterexample: [11]. The output y=1 of the net of Figure 2.25 is not correct. We 
have to introduce a new neuron that realizes the mask function x,,. With this 
additional neuron the weight vector becomes W(2)=[1, 1, 0]. The first-layer extended 
output vector becomes ĉ(2)=[1, 1, 1]. The output y=1 of the net is incorrect, so we 
have to subtract from wW(2) the vector 2(2)=[1, 1, 1}. We obtain w(3)=[0, 0, — 1]'. 


i + i Ý 
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Practical statement 2.2 


‘In some applications it might occur that the training set contains contradictions. If s 
we apply in that case the recruitment reinforcement rule with a decreasing value of w», ¥ 
the learning rate e(k), the influence of those disruptive elements will be reduced. In 

that case the output of the final net will become equal to the target values of the 
elements in the training set that occur with the highest frequency. 











Table 2.14 
Input Input 
vector vector 
XiXe Output Xi.. X6 Output 

000000 1 110001 1 
000001 1 110010 1 
000010 1 110100 1 
010000 t 000111 0 
001000 0 011010 — 
000100 0 111000 — 
100000 t 101100 0 
000011 1 001101 0 
001010 1 100101 0 
010001 1 001111 — 
010010 1 010111 1 
010100 1 011101 — 
100001 1 011110 1 
100010 1 101011 1 
101000 1 101101 — 
110000 1 110011 1 
000101 0 110101 1 
001001 0 110110 1 
000110 0 111001 1 
001100 0 111100 1 
100100 0 100111 0 
011000 0 011011 — 
001011 1 101110 — 
001110 — 111010 — 
010011 1 011111 1 
Figure 2.27 The final neural network showing the exclusive-or function 010101 1 101111 1 
010110 1 110111 1 
| 011001 = 111011 1 
8 011100 0 111101 1 
The obtained neural net is given in Figure 2.26. When we proceed in the same way i 100011 1 111110 1 
and apply the sequence of inputs [10], [00], [01], [00], [11], [10], [01], [00] we will ; 101001 1 111111 l 
finally find the network of Figure 2.27 that realizes the exclusive-or function. E i 101010 l 100110 0 











counterexamples are given to the net. This behaviour is quite useful if one needs to 000000 110001 
design a complex switching circuit. Using a BPC can simplify matters considerably 09000. 110010 
Let us take a look at the next example. f 000010 110100 
010000 000111 
001000 011010 
000100 111000 
100000 101100 
000011 001101 
001010 100101 
010001 001141 
010010 O10111 
010100 011101 
100001 011110 
100010 101011 
101000 101101 
110000 110011 
000101 110101 
001001 110110 
000110 111001 
001100 111100 
100100 100111 
011000 011011 
001011 101110 
001110 111010 
010011 011111 
010101 101111 
010110 110111 
011001 111011 
011100 1141101 
100011 111110 
101001 11111 
101010 100110 
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Imagine one has to design a combinatorial circuit that realizes the function specified 
in Table 2.14 (a dash indicates a ‘don’t care’). This is a function with six inputs 
(x, ...X¢). The total number of input vectors is sixty-four (=2°). 

According to the adaptive recruitment learning rule described in Section 2.7 one 
has to train the net starting with the zero input vector and then input vectors 
The neural network obtained after twenty-two steps illustrating Fs containing only one 1. Then continue learning using the input vectors containing 
Table 2.14 i two Is, and so on. This assures correct learning. 
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2.10 Application of the adaptive recruitment learning rule i Tabie ZIS 
to switching circuits i Input Input 
i Pi i vector vector 
As stated before, a binary Perceptrbn (BPC) can and will realize any logical function, Xi. X6 Output Xii kg Output 
and thus any combinatorial digitgl circuit, provided that enough examples and | 
: 
H 





Figure 2.28 
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Table 2.16 

ae Xq Wa 

oa 

“+S Xaoo000 | 1 
Xao1000 =l 
X000100 = 
Xooroto ! 
X010100 l 
X101000 1 
Xor1010 =] 
X111000 -1 
Zoor110 1 
Xo11001 l 
Xorrioo l 
Xotioti =d 
Xiortio =2 
Nrori0 1 
Xiron 1 
Xiro l 


The net was trained with all vectors containing at most two Is. The neural net 
obtained after learning with the twenty-two input vectors is given in Figure 2.28. The 
output y of the neural net turns out to be consistent with the specification given above 

Although there exists a set of just six sample input vectors that will also yield the 
same neural net after learning, we may say we are just lucky that we found the correct 
neural network with twenty-two sample input vectors. If the specification of the 
logical function to be identificd is changed on the positions of the ‘don’t cares’ to 
the opposite value of the corresponding output of the neural obtained before (see 
Table 2.15), we obtain the same neural net with the same twenty-two learning input 
vectors, but the output will be wrong for the input vectors with ‘don’t cares’ in the 
original table. (Wrongly classified input vectors are marked on Table 2.15 with a ‘*’.) 

The minimal Boolean expression for the logical function specified by Table 2.15 is: 


VEX XX Ns AN NING + XQX 5X6 HX1X3X6 +X3X4 +X2X4 


This expression was obtained using the McCluskey minimalization algorithm. 

In order to guarantee that the ncural net will realize some logical function one 
has to apply all input vectors. If we apply the sixty-four input vectors of Table 2.15 
we obtain a neural net as specificd in Table 2.16. The first column gives the'set of 
mask functions realized by the sixteen neurons in the first layer. The second column 
gives the weights connecting the corresponding input neurons to the output neuron. 


A sce that only sixteen first-layer neurons are required to realize the combinatorial 
unction. 
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2.11 Application of the adaptive recruitment learning rule 
to hyphenation 


During the last decade several software programs have been developed to break `, 
words with a hyphen. These programs are in general complicated because they are 
mostly based on grammatical rules and exceptions to these rules. Furthermore 
computers are notoriously bad at hyphenation. In the beginning of the automatic 
hyphenation, errors such as ‘the-rapists who pre-ached on wee-knights’ could occur. 
It takes considerable effort and time to develop and implement correct hyphenation 
algorithms. If we use a neural network, however, we can leave the job of ‘developing 
and implementing the hyphenation algorithm’ to a learning process on the basis of 
examples of hyphenations. 

We will show that we can use a binary Perceptron to learn correct hyphenation. 
In our experiment we used the 70000 words out of a dictionary of hyphenated words 
of a Dutch minority language, Frisian. 

Because the input of a binary network requires binary vectors we had to code the 
language alphabet. The code can be arbitrarily chosen but it is profitable to use a 
code with few ‘Is’ for alphabet symbols that occur frequently in the language and 
to code ‘phonetically similar’ symbols with a similar binary code. We used the code 
shown in Table 2.17. 

One method is to present to the binary neural net a binary coded word and require 
that the target be an output vector of the same length as the uncoded input word 
with only a ‘I’ on the position of an input symbol after which the word may be 
broken. In that case one has to limit the length of the words. Moreover the number 
of inputs and outputs is large. 

We took a different approach. We look through a window with a length of seven 
symbols which moves from left to right across the word. We observe through the 





Table 2.17 

a 10000000 Z 00101110 
u 10001000 h 00100100 
y 10001100 j 00100110 
e 10000100 v 00100111 
o 10000110 l 00010000 
i 10000111 c 00011000 
b 01000000 w 00011100 
k 01001000 q 00011110 
t 01001100 x 00011111 
d 01000100 m 00010100 
p 01000110 n 00010110 
g 00100000 r 00010111 
f 00101000 - 00000000 
s 00101100 
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Table 2.18 


Word-segment --PER-CEP-TRON.- - 





i ‘Neh PER 0 
2 -PERC 0 
3 PERCE 1 
4 ERCEP 0 
5 RCEPT 0 
6 CEPTR I 
7 EPTRO 0 
8 PTRON 0 
9 TRON- 0 
10 RON- - 0 


window a word-segment of seven symbols. If the hyphen must be placed after the 
third symbol we require the target output of the neural net to be ‘I’, and ‘0’ otherwise. 
Table 2.18 illustrates this method. The word ‘Perceptron’ is given and with a window 
of five symbols we move across the word from left to right. We observe successively 
ten different word-segments. The correct response to each word-segment is given in 
the last column. 

From the 70000 words in the dictionary we obtained approximately 250000 
different word-scgments. It can occur that the same word-segment requires a response 
‘I as well as the response ‘0’", ¢.z. the word-segments ‘-record’ and ‘record-’ from the 
word ‘record’. The word ‘record’ is supposed to be broken as ‘rec-ord’ when it is a 
noun, but as ‘re-cord’ when it is a verb. We eliminated all those contradictory examples 
from the obtained list of word-segments and then made from them one special lookup 
table of exceptions. It turned out that 1039 of the 250 000 examples were contradictory. 
The list of non-contradictory binary coded word-segments with their target values 
represent the training set D of a binary function f: {0, 1}7*8—{0, 1} that can be 
learned via the adaptive recruitment rule as discussed in Section 2.7. According to 
our proof in Section 2.7, the learning set D will be classified 100 per cent correctly 
after learning. 

After learning, we obtained a binary neural net of +20000 neurons in the first 
layer and onc output neuron. Ina hardware realization of the obtained neural network, 
computation time would be negligible because of the parallel structure of the neural 
net, but in a simulation of the neural network on an ordinary PC it requires several 
seconds to compute the position of the hyphens in a word. For any realistic application 
this time is too long. A simple solution to this problem is to divide the learning set 
D into a large number of disjoint subsets D; and learn a separate binary Perceptron 
for each subset. If after learning we have a simple rule to decide to which subset D, 
a word-segment belongs, we only have to consider the output of the neural net trained 
with subset D;. In an experiment we divided the learning set D with 250000 elements 
into 27 x 27=729 subsets such that all word-segments with the same third and fourth 
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symbol were placed in the same subset. We trained 729 different binary Perceptrons 


separately with those training sets. After learning, every subnetwork i will correctly | 
classify the elements of the corresponding learning set Dj. The total number of 


first-layer neurons after learning was 32522. The mean number of neurons in one 
subnetwork is 32 522/729 =45. With an ordinary PC the time to compute the correct 
positions of the hyphens in a word is reduced in this way to a few milliseconds. , 

A drawback of the adaptive recruitment rule is the requirement to order the learning 
set D according to the increasing number of ‘Is’ in the coded elements. A second 
disadvantage is the requirement that the learning set may not contain contradictions. 
The recruitment and reinforcement learning rule does not have these disadvantages 
but will yield a larger neural network and it requires a very long learning time, as 
may become clear from the next application. 


2.12 Application of the recruitment and reinforcement learning 
rule to contradictory binary data sets 


According to the Practical Statement 2.2 the learning set is allowed to contain 
contradictions if we use the recruitment and reinforcement learning rule as discussed 
in Section 2.9. If there exists a contradiction in the data set, the final neural net will 
respond with the target value of the most frequent target for the same input if the 
learning rate is decreasing with time. In an experiment we used the learning set of 
Table 2.19. The input vector [010101] occurs twice with target ‘1’ and once with 
target ‘0’. 

First we performed an experiment using recruitment and reinforcement learning 
with no contradictions in the data set (we took the target for input 101010 equal 
to 1). We took a value of 0.5 for the external learning rate £. Learning was stopped 
at the moment that all inputs had been correctly classified. After learning, the first 
layer of the neural network contained at least forty and at most forty-nine neurons 
depending on the sequence of applied examples in ten experiments. (With the adaptive 
recruitment rule we found in Section 2.10 a neural network with only sixteen first-layer 
neurons for the same problem.) If we use another constant value of the learning rate 
(e.g. e= 100), the same network is obtained if the sequence of applied examples is the 
same, only the absolute value of weights is changed in proportion to the change in 
the value of the learning rate e. The number of wrongly classified examples that must 
be applied before correct behaviour was obtained was approximately 126. (With the 
adaptive recruitment rule sixty-four examples must be applied.) 

If we introduce a contradiction in the data set, as mentioned above, and use a 
linearly decreasing value of the learning rate e(t) from 1 to 0.01 in 150 learning steps 
and constant 0.01 after that, a correct neural network (with output t for input 010101) 
was obtained with the mean of forty-five neurons (in ten experiments) in the first 
layer. The number of wrongly classified examples that must be applied during learning 
varied between 110 and 170, depending on the order of applied examples. 
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function. 





Table 2.19 

Input Input 

vector vector 

XiX ita 4 Output Xi- X6 Output 
000000 l 110001 1 
000001 1 110010 1 
000010 1 110100 1 
010000 1 000111 0 
001000 0 011010 0 
000100 0 111000 0 
100000 1 101100 0 
000011 - 1 001101 0 
001010 1 100101 0 
010001 1 001111 1 
010010 1 010111 1 
010100 1 011101 1 
100001 [i 011110 1 
100010 ! 101011 l 
101000 l 101101 1 
110000 ! 110011 1 
000101 0 110101 1 
001001 0 110110 1 
000110 0 111001 1 
001100 0 111100 1 
100100 0 100111 0 
011000 0 011011 0 
OO10T1 1 101110 0 
OOTTLO 1 111010 0 
OLOOET | 011111 1 
OLDIOL (2x) l 101111 1 
01010t (1x) 0 

OLOTL0 1 110181 1 
OLLO01 1 111011 1 
011100 [i 111101 1 
100011 t 111110 1 
101001 1 1flidi 1 
101010 1 100110 0 


. Show that Table 2.2 cannot be realized with a single-neuron binary Perceptron. 
. Show with the use of the consistency property mentioned in Section 2.2 that 
Table 2.2 cannot be realized with a single-neuron binary Perceptron. 

. Apply the adaptive recruitment learning rule to the data set of Table 2.2. 

. Apply the adaptive recruitment learning rule to the data set of the exclusive-or 
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5. Apply the adaptive recruitment learning rule to the data set of Table 2.2. 

6. Explain why the weight vector w points in the direction of the positive side of 
the separating hyperplane defined by: wx =0. 

7. Show with the help of Lemma 2.1 that the linear threshold function 
y=S(x, +x2— x3) is equivalent to y=S(x, +X,—x3;—0.5). 

8. Apply the reinforcement learning rule to the data set given in the following table. 
The initial weights of the single neuron binary Perceptron are: Wo = 1,w,=1 and 


w,=-l. 





Xi x2 y 





0 0 0 
0 1 1 
1 0 1 
i 1 l 





9. Let xq (p;) be a mask function. Determine the value of x,(p;) for q=001100 and 
p;= 011110 and p,;=000100. 

10. Determine with two different methods the arithmetical conjunctive normal form 
of the logical function specified in the following table. 





x 
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11. Apply the proper adaptive recruitment learning rule to the data set of the table 


of Exercise 10. 
12. Let the set of examples be E= {001,010} and the set of counterexamples be 


i F={011}. Apply the proper adaptive recruitment rule. Show that the output of 
the obtained neural will be equal to 1 for the set L and will be 0 for the set P — L. 


With P={0, 1}° and L: 


L=s( ¥ acC,- & 2,64) with S the set step function 


q EE qEF 
with C,, the cover of example q; and C, the cover of counterexample q;. 
13. Apply the recruitment and reinforcement learning rule to the data set of the table 
of Exercise 10. 
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THE CONTINUOUS MULTI-LAYER 
PERCEPTRON 


3.1 Introduction 


A continuous multi-layer Perceptron is used in a situation where one presupposes 
the existence of an unknown functional relationship g: XY between two sets X 
and Y of data with X CR" and YCR"”, whereas for a given finite data set 
D={[x,, t(x ]]xjeX, UxJeX 1 <i<N} it is assumed that ¢(x,)=g(x,). In that case one 
wishes to identify or to approximate the unknown function g on the basis of the set 
of samples from D. 

A multi-layer Perceptron can realize an infinite set of functions gẹ: R'+R™ 
depending on the vector w, composed of all weights in the neural net. 

Given a learning set L that is a finite subset of the data set D, and a so-called 
learning rule one can change the weight vector w of the neural net such that [x;, gw(x;)] 
becomes equal or approximately equal to [x;, t(x;)] for each pair in L. In the learning 
phase the weight vector w is changed step by step by presenting the set L to the 
adaptation algorithm at cach step k. After learning, the performance of the net is 
tested with the test set T =D — L (see Figure 3.1). 

After the learning phase the neural network will also yield for any input x; not in 
D some output gy(x,). If it is assumed that the a priori unknown function g is 
approximated by gy if for all [x;, (x,)JeD the value t(x,) is approximated ‘well enough’ 
by gy(x;), then we call this phenomenon generalization. 

The multi-layer Perceptron consists of one, two or three layers of neurons. A layer 
may have an arbitrary number of neurons. In Figure 3.2 a two-layer Perceptron is 
given with two neurons in the first layer and one neuron in the second layer. Each 
neuron has a number of inputs and one output. The output of a neuron in some 
layer constitutes an input for cach ncuron in the next layer. The output value of a 
neuron is some function f of the weighted sum s(x)=2Zw;x;+Wo of all its input 
values x;. The scalars w; are called the weights of the neuron and — wy is called the 
threshold of the neuron. The transfer function f is usually the same for all neurons 
in the neural net. The most common transfer function is the so-called sigmoid function: 
S{s(x)J =Lt +exp —s(x)] | (see Figure 3.3). 

In a so-called learning phase, input vectors x;=[X;,, Xiz,- Xin)’ from a given finite 
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Figure 3.1 Outline for learning and testing 





Figure 3.2 A two-layer continuous Perceptron 
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Figure 3.3 The sigmoid transfer function 


learning set L are presented to all neurons in the first layer of the neural net. For 
each input vector x; the adaptation valucs for all the weights in the neural net are 
computed separately. The computed value of the adaptation Aw, of all weights in 
the neural net for the input vector x; is such that for each neuron j in the output 
layer the square of difference between some target value t{x,) and its actual output 
value y{x,) will become smaller if the adaptation were effected. After the computation 
of the adaptations of weights for all inputs x; the weights are actually adapted by 
Aw = Aw,. 

Multi-layer Perceptrons are frequently used for classification of data into different 
classes. For instance, one can use the Perceptron of Figure 3.2 to separate a collection 
of mushrooms into a class A of mushrooms that are very good medicine for some 
specific illness, and a class B of poisonous mushrooms. Assume the mushrooms look 
almost the same. We want to have a criterion by which we can decide to what class 
some particular mushroom belongs. One dangerous way to classify the mushrooms 
is to eat them, but then the mushrooms are destroyed, so we have to consider another 
criterion. 

Assume the mushrooms of the two classes differ alittle bit in the length x, and in 
the thickness x, of the stem. Assume we are given a small set D, of measurements 
of the length and thickness of mushrooms of class A and another set Dy of 
measurements of length and thickness of mushrooms of class B. The values of x, 
and x, for the set D, are represented by small circles in Figure 3.4 and for set Dy 
by small boxes. A solution for the classification (not the best one) is given by the 
separating lines L, and L,. All points to the right of L, and simultancously to the 
left of L, belong to class A; the points in the remaining area are assumed to represent 
elements of class B. 

We want the Perceptron of Figure 3.2 to learn to find that classification. For now, 
we will not show how learning is performed but we will demonstrate that for some 
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Figure 3.4 Representation of mushrooms by stem thickness (x;) and stem 
length (x2) 


of values of the weights w; in the neural network, the classification is possible by that 


Perceptron. 
We want to have y,;=1 for the elements of class A and y;=0 for the elements: of 


class B. Let the weights of neuron | be: 
W1,,=100 wy2=100 wio= — 1000 
The line L, is now represented by: 
Wy tX FW 2°X2 + Wo = 100x; + 100x, — 1000 =0 


For points on this line we have for the output y, of neuron | that y, =0.5. Points 
at the right of this line (and not too close to this line) will give an output y, =! and 
points to the left will give y, 0. Let the weights of neuron 2 be: 


w2,=100 w2,=—100 wr 9=0 
With these weights we have for all points on the line L, that 


W21X1 + W22X2+W9 =0 
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and thus the output y, of neuron 2 will be 0.5. Points at the right of this line will 
give an output y, 1 and points to the left will give y,=0. 


Thus for the points of class A we have y, &1 and y,2=0. For elements belonging 


to class B we obtain: re, 
oF noe 0) or (0, 1) or (1, 1) 
Let the weights of neuron 3 be: 
w3,=100 w,,=—-100 w= —S0 
Thus the weighted sum of inputs of ncuron 3 is: 
s(x,) = 100y, — 100y, — 50 


One easily verifies that only for the elements of class A will the output be y, = 1; for 
the elements of class B we obtain y, =0. 

We see that the classification can be performed by the Perceptron. The set of 
weights is not unique; there are an infinite number of solutions for the same 
classification. 

An intriguing question is how the Perceptron can learn from the given data sets 
D, and Dy a correct set of weights. We will deal with that problem in the next chapter. 


3.2 The gradient descent adaptation method 


In Figure 3.5 the general structure of a three-layer Perceptron is given. It turns out 
that any task performed by a multi-layer network can always be solved by a three-layer 
Perceptron, so we restrict our description of a multi-layer Perceptron to the structure 
of Figure 3.5. 

For any output neuron we can derive the relation between the output vector y 
and the input vector xX =[X4, Xass es Xm J` 

For the value of the ith output in the third layer we have: 


Yu = fasai) 


with fy; the transfer function of the particular output neuron and są; the weighted 
input of that neuron defined by: 


y= X WaijY2j t W3io 
J 


with wy; ; the weight in the connection between the ith neuron in the third layer and 
the output y; of ncuron j in the second layer, and Wy; the threshold weight of the 
particular output neuron. 

For the value of the jth output p; in the second layer we obtain similarly: 


Vay = Sof52)) 
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Figure 3.5 General structure of a three-layer continuous Perceptron 


with: 
S2j= yy Wj kik + W2jo 
k 


For the value of the kth output y,, in the first layer we have: 


Yik = fiSin) 
For the value of the weighted input of the kth neuron in the first layer we have: 
Sik = È WikmXm + Wiko 


with x,, the mth input value. 
By substitution we obtain from the equations above for the relation between the 
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ith output and the inputs x4, X2,---, Xm of the network: 


y= Sal Evata Daudi Z wuntat Wino) twaot twsa] 
f OUER A Nm 


with the summation over the number of neutons in the particular layer. 

By introducing a dummy ncuron in layers 2 and 1 with constant transfer functions 
foo) =1, fiol)=1 and by adding a constant additional input xọ= 1 to the net, we 
can incorporate the thresholds w,,.9 of the neurons in the weighted input of the 
neurons by the terms w3;ofzol), W230 fiol) and W14,0Xo- 

By this convention we can rewrite the last equation as: 


r= fa $ wash} $ wasshal 3 a) 
j20 k=0 m=0 


The transfer functions fy; Jz; and fix are in general the same and are non-linear. If 
the transfer functions are linear then the whole structure collapses to one single 
neuron as expressed in the following theorem. 


Theorem 3.1 


A linear Perceptron (i.e. all transfer functions are linear: f (s)= a*s) can be reduced to 
one single-layer linear Perceptron. 


Proof 


The relation between the output value of the ith output neuron and the ny inputs of 
the network is: 


ny No 
weh 3 s Wijf hu 3 X wadal È 2 WikmXmt Wik, o)+wzio} twsa] 


j=l 


Replacing each transfer function by the scalar factor a, we obtain: 


jaa aps Š w w, 5 vasa $ Wn) 
k71 m=i 
[Emad È waaa 


Tm 
2 Sage Se \ 
a | È winasa) | 
Pe Ga 


HUW yo 


The three last expressions together represent a constant W3;,; 9. 
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The triple summation in the first expression: 


ny ny Ho 
> wus} > was 2 WikmX -)} 
jai k=1 


can be replaced by a single summation: 


No 


> W3imXm 
m=1 


Thus the relation between the output value of the ith output neuron and the ng 
inputs of the network becomes: 


No 
F W3i,mXm t W3i.0 


This equation represents a single linear Perceptron. The same holds for the output 
values of the other output neurons. This completes the proof. QED 


Now we will discuss how we have to train the neural network to behave in a 
desired way, or in other words how we have to adapt the weights in the neural 
network such that after the learning period when we present an input vector x;, the 
outputs of the net will be equal to (or approximately equal to) some target vector 
t(x). This ultimate goal must be represented by some criterion function indicating 
the performance the neural net. In general this will be a function that is increasing 
with the error and will only be zero if the error is zero. Thus we have to minimize 
such a criterion function during the learning period. 

Given some finite learning set LSD of N pairs [x,, t(x,)], let U be the set of input 
vectors in L. We want to have for each input vector x;€U some target output vector 
t(x,). Let ¢,(x;) be the target value of the jth output neuron for input vector x; and 
let y(x) be the actual output of the jth neuron in the output layer of the neural 
network. We take the following criterion function as an error measure: 

1 "a 
(x;) > [xd y x]? 
xeU j=1 
with N the number of samples in the learning set L, n(x,) the number of times input 
x; occurs in U and n, the number of output neurons in the third layer. 

We will call this error measure the mean squared error (MSE). Because N is a given 
constant whereas in many applications n(x,)=1, we can eliminate both from the 
equation above. To simplify the following expression for the derivative and to conform 
to the expression for the error usually given in literature, we add a factor 1/2. We 
obtain in that case for the MSE: 


Ny 


D È ix) yx) 


2 yet mi 


This expression for the error is frequently called the energy or error function of the net. 
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The error E is a function of all weights in the net. For a given finite learning set 
L and some initial random distribution of the weights the MSE will in general be 


large. We want to change, during a so-called learning period, the weights step by step, - 


such that E will decrease to its tlobal minimum. Only when the structure of the 
neural net is able to realize exactly the learning set L, must the minimum of E become 
zero. 
For an infinitesimal increment Aw, of the weight w, in the connection of some 
neuron, somewhere in the net the following holds: 
Ap ai: 


ôw; 


Thus if we take Aw; proportional to —(@E/dw,), then AE will be negative as long as 
Aw; is small enough. In Figure 3.6 we have illustrated this adaptation Aw, graphically. 
In Figure 3.7 we see that the successive values of E may even increase if Aw, is not 
small enough. 





Aw w 


Figure 3.6 The gradient of the error function 





A ws 


Figure 3.7 The influence of large values of Aw; 
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We can recapitulate our theory in the following theorem: 
Theorem 3.2a 


The MSE of a multi-layer neural network will decrease if a weight w; is changed 
according to: ‘ 


E 
Aw; = sig E for some ¢>0 sufficiently small 


Wi 


Or in a more general form: 
Theorem 3.2b 


The MSE of a multi-layer neural network will decrease if the weight vector w is 
changed according to: i 


Aw = —eVE(w) for some £>0 sufficiently small 


with VE(w) the gradient vector with respect to w. 


The proportionality constant € is called the learning rate. In general it will be 
profitable to change the learning rate ¢ during the learning process in a well-defined 
manner. We will return to this subject later. 

The learning rule prescribed by theorem above is frequently called the learning 
rule based on the gradient descent method of the MSE. The theorem does not imply 
that one will ultimately reach the global minimum of the MSE; it may happen that 
one ends up in a local minimum of the MSE function (see Figure 3.8). 











Figure 3.8 A global and local minimum of the error function 


t te i 
Laid 
i 
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Because the expression for the MSE is given by: 
1 Ny 
| Em 5 Ds È Ax) —y Axa]? 
: ; 2 keV i= 1 
vy , 


we can rewrite the learning rule as: 


ny Ry 
Awi= F Fela-y 
meu j Ow; 

To calculate the adaptation of some weight w; we have to evaluate the error 
t{x,)— yX) and the derivative dy,/Cw, for all output neurons and for all x, of the 
learning set L. The calculation of dy,/ew; depends on the structure of the neural 
network and will be discussed in the following sections. 

If we adapt some weight by E Aw, after the calculation of Aw, for all x, of the 
learning set L, then we call this procedure global learning (or batch training). The 
number of times the weights are adapted for the whole set L is called the global 
learning time C. 

Sometimes the adaptation of a weight w; is executed (in conflict with our theory) 
after each element x, of L separately. We will call this procedure local learning (or 
incremental training). 

Because global learning with a large learning set L can take a considerable amount 
of time, one frequently uses the so-called staged training procedure: One adapts the 
weights at each step, c; times (called cycle time) for a subset S;cL such that 
S,cS,¢S,c°*' c S= L. In that case the global learning time is equal to c=2¢,. 
Staged training is a valid procedure for minimizing the MSE because it ultimately 
results in S,=L, as required by the derived learning rule. 


3.3 Learning with a single-neuron continuous Perceptron 


Although only small-scale problems can be solved with a continuous single-neuron 
Perceptron, the analysis of its behaviour will provide us with some understanding of 
the properties of the fundamental building unit of the more general multi-layer 
Perceptron. 

In this section we will determine the adaptation of the weights of the single neuron 
based on the gradient descent of the MSE as discussed in the previous section. 

The output of a single-neuron Perceptron (see Figure 3.9), with some transfer 
function f, for an input vector x; is equal to: 


ga) = Fu] = S| X wy) with x;o=!1 and — wo the threshold 
j=0 


Assume we want to realize with the single-neuron Perceptron for input vector x, 
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Figure 3.9 A single-neuron continuous Perceptron 


a target value ¢(x;) with x;=[X,,, Xi2»---» Xin EU with U a finite subset of R", i.e. 
[x,, t(x;)JeL the learning set. 
For the error function we have: 


1 
=. S [x -gw(x)}? 


x,eU 


For the adaptation of some weight w, we obtain, according to Theorem 3.2: 


Aw,= eee =e 2 EPE TET 
ew; veU ds 

In the following we will distinguish the extended weight vector W=[Wo, W1,---, Wall's 
from the weight vector w=[w,,W2,...,W,]. In the same way we distinguish the 
extended input vector È =[1, X;1, Xj2.---» XinJ' and the input vector x= lkas Xia s+ Xin] 

Because for a given input vector x; the value of (t(x;)—gw(x;)) and of the derivative 
df /ds is independent of j, we obtain with respect to the incremental vector AW of the 
weight vector W=[Wo, Wy, W2,-.-, Wn] the following theorem: 


Theorem 3.3 


In order to minimize the MSE for a given set of target values t(x;) for a given finite 
set U, the adaptation of the weight vector Ŵ of a single neuron Perceptron must be: 


Awae Z [rix atx Ee, 

xeU ds 
We see that, for a single input vector x;, |AW| is proportional to |X,| and in the direction 
of the extended input vector %; if t(x;)>g(x,) and in the opposite direction of x; if 
1(x,)<g(x,). The same holds for the relation between Aw and x; in the non-extended 
input space. For the case of three-dimensional extended input space, see Figure 3.10. 
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Figure 3.10 The extended input/weight space. The weight vector adaptation 
AW is proportional to the input vector È 


Example 3.1 


Take a single-ncuron Perceptron with (wo inputs x, and x, and a threshold wo. For 
the sake of simplicity we take a lincar transfer function for the ncuron, defined by 
J(s)=s, with s= wo tw x, +wx. Assume U ={Đ1, 85 01, — 1, [— 1, 17%. 

The target functions are afi. tJ) =0.9, {[1, —1J)=01 and «[—1, 1J)=0.1. 
Assume the initial weights are: w, =0.5, w, =0.5 and wo =0. 

If we present the elements of U to the net we obtain: 


x=[l 1]! =s(x)= 0.5 +05 =l ay(x)= 1 t(x)— g(x) =0.9-1 = —-0.1 
x=[1, =- 1]'=s(x)= 0.5 0.5-0 >¢9(x)=O01(x) — g(x) =0.1-0= +0.1 
x=[—-1, | ]'=s(x) = -0.5 10.5 -0 9(x)=0>t(x)—y(x)=0.1-0= +01 
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_ The adaptation of weights is prescribed by: 


a= oF, (4) ate 2 A 
Because for all s(x,): df/ds=1 we have: i) t | 
Awo 1 1 l 0.1 ` | 
Aw, |=e(—0.1)} £ | +e(0.1) 1 ]+e(0.1)| — 1 | =ef — 0.1 | 
Aw, l -1 1 -0.1 | 
i 


We have stipulated that « must be small enough (e.g. e=0.01) to guarantee the 
decline in E. Incidentally in this simple linear case we can take e=1 and after 
adaptation we will obtain the values of the weights for which the MSE is zero. 

For this simple case we can easily calculate without learning the values of the three 
weights such that the MSE is zero. For a zero MSE we have: 


for x=[l, 1]! t(x) =g(x)=0.9 =s(x)= +w, +w: +Wwo 
for x=[1, —1]' (x)=g(x)=0.1 =s(x)= +w, —w2+Wo 
forx=[-1,1]' t(x)=g(x)=0.1 =s(x)= —w, +w2+ Wo 


It follows that the error is zero for w, =0.4, w, =0.4 and wọ =0.1. 
In Figure 3.11, a three-dimensional plot for the error E as a function of w, and, 
w is given with wọ =0.1. Figure 3.12 shows the contour lines of constant error levels. 
a 


According to the learning rule, we adapt the weights after calculating the error for 
the whole set U of input vectors of the learning set L. We have called this procedure 
global learning. 





Figure 3.11 The error function with wy =0.1 of Example 3.1 
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Figure 3.12 The constant-level contour lines of the error function of 
Figure 3.11 


What will happen if we adapt the weights after the calculation of the error for each 
element of U separately? This method of learning is called local learning, and the 
following example shows that, in gencral, global and local learning differ after a finite 
numter of adaptations (globally one step and locally four steps). In general, after a 
finite number of steps in case of local learning, the MSE will depend on the particular 
sequence of elements of U. 


Example 3.2 
Take the previous example with the same initial distribution of weights: w, =0.5, 
w,=0.5 and wọ=0. 

We calculate first the error for x =(1t, 1): 


x=[I, 1} =s(x)=0.5+05=1 = y(x)=1 =t(x)—g(x)=0.9— 1 = —0.1 


We take c=1. 
According to the learning rule the weights become w, =0.4, w, =0.4 and wy = —O.1. 
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For the new set of weights we calculate the error for x=[1, —1]': 
x=[1,—1]! =s{x)=0.4—0.4—0.1=—0.1 =g(x)= —0.i 
=t(x)—g(x)=0.1 —(—0.1)=0.2 


After adaptation the weights will be w, =0.6, w,=0.2 and wọ=0.1. 


We calculate next the error for x=[—1, 1J ` 


x=[—1,1]) =s(x)= —0.6+0.2+0.1 = —0.3 =g(x)= —0.3 
=>t(x) — g(x) =0.1 —(—0.3)=0.4 
After adaptation the weights will be w, =0.2, w, =0.6 and wy =0.5. 


Thus the final weights are w, =0.2, w, =0.6 and wọ =0.5. The MSE in this case is 
not zero. = 


In general the MSE will have several minima, and depending on the initial 
distribution of weights, one may end up in some local minimum instead of the global 
minimum. It turns out that with local learning one can frequently avoid local minima. 


3.4 The exact fitting of the data set with a single-neuron 
Perceptron 


In this section we investigate whether the function realized after learning by a single 
continuous Perceptron can go exactly through the samples of the given data set 
D={(x;, t(x)}- 

The following theorem will show that only under very restricted conditions can a 
function gy(x) that is realizable by a single-neuron Perceptron, go exactly through 
the samples of the data set. 


Theorem 3.4 


A function gẹ: R”—R that can be realized by a single-neuron Perceptron with transfer 
function f can go exactly (the MSE is zero) through the samples of the data set 
[x;, (x ]eD, iff for every element [x;, t(x,)] from D the vector: 


i= {Ra f Lex) ]} with &=[1, Xis Xiz- Xin 
is a linear combination of some unique set of n+1 linear independent vectors: 
R= {Kf EO 


obtained from n+1 elements [x;,, t(x,)] from the data set D. 
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Proof 


Let the data set D contain a subset S with n+1 pairs [x,, ¢(x))] such that we can 
form n+1 independent equations: 
: vf 


$ wexja=f U(X )] for each element in S (NB xjo= 1) 
k=0 


From these equations all n+! weights and thus gw, which exactly go through the 
samples of the subset S, can be determined. f 
If in addition for every clement [x;, t(x;)] of the data set D we can write: 


n+t n+ 


1 
{x f 7 'Le(x)]} -f È ak, > aston} with [x,, ¢(x,)]eS 


jl j=l 


and not all «,=0, then for every element (x;, ¢(x,))eD the following relation should hold: 
n n n+l n+1 n 
GwX;) = sf È weza) = sf 2 Wk X a] = if X Xj 2 wex) 
k=0 k=0 j=1 j=l k=0 


n+1 

= X af ‘tosh =f{f- *Le(x,)]} = (x) 
j=l 

Thus the function gw also goes exactly through all the samples of D. If, on the other 

hand, the data set should contain more than one subset S with the property mentioned 

above we would find a different function gy for each such a set and hence no unique 

solution for gw exists. : QED 


Example 3.3 


Take a single-neuron Perceptron with two inputs x, and x, and a threshold wy. The 
transfer function of the ncuron is defined by: 
fis}=s with s=wWo+w,xX, + w2X2 
Assume U={x,,x3,X3, Xa} =((1, 1), (1, — 1), (— 1, 1), (— 1, 3)}. 
The target values are: (1, 1)=0.9, (1, —1)=0.1, (—1, 1)=0.1 and (—1, 3)=0.9. 
The first three extended input vectors $; are linear independent, thus we can 


calculate the required values of the three weights. 
For a zero MSE we must have: 


for x=(1, 1) t(x)=g(x)=0.9 = s(x) =Wo +w; +w2 
for x=(1, — 1) (x)=g(x)=0.1 =s(x)=wọ+w;— w2 
for x=(-1,1) (x)=g(x)=0.1 =s(x)=wọ—- w, +w: 


Thus wy =0.1, w, =0.4 and w =0.4. 
For the fourth sample we have [Ra f7 '((x4))] is a lincar combination of 
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ERa STUD ERa S7 aD] and ERa, f~ EIEL, — 1, 3, 0.9]=[1, 1, 1, 0.9] — 


[1, 1, —1,0.1]+f£1, — 1, 1, 0.1], thus the function g,(x) also goes through the fourth . 
sample. a. 


Example 3.4 ‘ 
It will be clear that we can exactly fit the data points when the data points belong 
to the function g,(x) realizable by a single-neuron Perceptron. For example, given 
three data points D={[x,,e(x,)], [x2 t(x3)], [x3, #(x3)]} with t(x) = gw(x) = 
f(Wo + 1X, for some wo and some w,, we can then determine from two examples 
w, and w, and the third example will satisfy the condition mentioned in Theorem 3.3. 
a 
From Theorem 3.3 and its proof we can conclude that we can calculate the weights 
if there exists a function gy(x) that goes exactly through all samples of the data set 
D, by selecting n+ 1 independent input vectors. 


Practical statement 3.1 


If a function realizable by a single-neuron Perceptron can go exactly through the 
samples of the data set, then we do not need to learn that Perceptron, because we 
can calculate beforehand the required weights from the data set D. 


Another consequence of Theorem 3.3 is the following statement: 


Practical statement 3.2 


If x in [x, t(x)]eD is of dimension n and the data set D contains p<n+1 independent 
extended input vectors %;, then the function gẹ: R’>R realized by the single-neural 
Perceptron can go exactly through all the samples of the data set D. 

If p=n+ 1, the function g,(x) is unique. If p<n+1, then there is an infinite number 
of solutions for g,,(x). 


The last sentence in the statement above reveals that generalization from a data set 
with a neural net can be dangerous. 

Because any ‘non-academic’ data set D does not fulfil the requirements of 
Theorem 3.3 we also have the following statement: 


Practical statement 3.3 


If x in [x, ¢(x)JeD is of dimension n, and the data set D contains more than n+1 
elements, then the function g,(x) realized by the single-neuron Perceptron cannot go 
exactly through all the samples of the data set D. 


i w i 
‘ale ¢ 
weg 
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3.5 The approximate fitting of the data set with a 
single-neuron Perceptron 


In the previous chapter we found that in most practical cases that any function 
realizable with a single-ncuron Petceptron cannot go exactly through the samples of 
the data set. However, we can always find with a single-ncuron Perceptron a function 
gw such that g,(x) is ‘approximately’ equal to the targets t(x) for all elements in the 
data set D. The quality of the ‘approximation’ is measured by the MSE. 


Example 3.5 


Assume the transfer function of the neuron is given by f(s)=[1 +exp(—s)]~' and 
$= WoXo + W,X1, With xo = 1. Thus there is one external input x,. Let D= {(x; t(x} = 
(1, 0.1), (—1, 0.1), (2, 0.4), (— 2, 0.4), (3, 0.9), (— 3, 0.9),} be the data set. [NB: ten times 
t(x) gives the square of x,.] The MSE depends on the value of wọ and wy. See 
Figure 3.13. 

The minimum of the MSE occurs at wo = —0.13 and w, =0. Thus the function to 
be realized after learning by the single-neuron Perceptron is the constant: 
gu(X) = [1 +exp(0.13)] 7 '=0.47 (see Figure 3.14); MSE =0.327. 

One might say that the function gy(x)=0.47, which will be learned by the 
single-neuron Perceptron, is a bad approximation of the function g(x) =0.1.x?, which 
one might presume to be the function underlying the data set D. However, if the 
function h(x) drawn in Figure 3.13 was the generator of the data set D, then Jul) = 0.47 
would not be a bad approximation at all. So without any additional information, 
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Figure 3.13 The crror function of Example 3.5 
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Figure 3.14 The constant function gy(x) witha minimal error for Example 3.5 


apart from the value of the MSE, one cannot qualify the correctness of the function 
learned by the neural net. E 


After learning, we only know that the function realized by the single-neuron 
Perceptron will approximate the given data set as closely as possible (measured by 
the MSE), as long as we do not end up in a local minimum of the MSE. 

In many cases, however, we do not have to use the learning algorithm to find the 
desired function gẹ approximating the data set D, because we can determine the 
pertinent weight vector W beforehand. 


Theorem 3.5 


Given a single-neuron Perceptron that can realize functions gy: R">R, then the 
weight vector w of the function gy: R">R with a minimal MSE for a given data set 
D can be determined from a set of linear equations if: 


1. The ‘error’ Agy(x,) =9w(x;)—(x;,) for all x; of D is ‘small’. 
2. The derivative dE/ds=0 only for its global minimum. 

3. The transfer function f of the neuron has an inverse f” +. 
Proof 

Let 


n 
s(x) = 3 WX ik 
k=0 


be the weighted input of the neuron. In case of the minimal MSE we have for each 
[x;, (x) ]eD: gu(x) = f [s(x:)], and the ‘error’ Ag,(x;)=gu(x,) — (x) in the output of 
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For each of the elements (x;, t(x,))eD we now have a linear equation: 


=l 
DY wxa= f {x+ Asx) = ST Hx) + (5) Ag(x;) 
k=0 ds 
with the derivative df/ds evaluated for f~ '{t(x,)}. 

For the N elements of D we can find N such independent equations, thus we also 
have N unknown values of Ag(x;) and n+ 1 unknown values of weights w,. However, 
because the MSE must be minimal we also have GE/dw;=0 for each weight w,. 

With: 


; w ? 
‘at 
£ f 


E= 25 {t(x;)— g(x) 


xed 2 


gx) = f (s(x) = f (3 > wn) 





FLex)] six) s — we obtain for each weight w; an additional equation: 
ĉE d 
EE > Agi) ATEN 
OW; xe ds 


The derivative df/ds should be evaluated at s(x;)= f~ ‘(g(x,); for small values of 
Ag(x;), however, we may evaluate the derivative at f ~ '[t(x,)]. We now have N+n+1 
unknowns and the same number of linear equations, and so we can determine the 
weight vector w and hence gy and also the minimum MSE. QED 


Figure 3.15 The graphical representation of the relation between As(x;) and 
Agu(X:) 


the neuron for input x; is assumed to be ‘small’. (For simplicity we will from now on 
drop the subscript w of gw.) 

If there were no error, the weighted input for the input vector x: would be 
s(x,)= f 7 '{t(x,)}; however, t(x,) # g(x,), thus there will be an error As(x;) in the weighted 
input. Thus we can write: 


Example 3.6 


According to Theorem 3.3 there is no function that is realizable by a single-neuron 
a P going through the samples of the following data set: 


= {(x; (x))} = {([1, 1J', 0.9), (Lt, — 17', 0.1), (C— 1, 17, 0.1), (C— 1, — 17', 0.1} 


However, the data set can be approximated by a function g, to be learned by the 
neural net and we can calculate the desired weights. 

For simplicity we take a linear transfer function f(s)=s. Because in this case 
f (t(x))=t(x,) and df/ds=1, we have As(x;)=Ag(x;,). Now the following relations 
must hold: 


s(x) =f 7 '{t(x,)} + As(x;,) 


If As(x;) is small (or higher derivatives of f(s) are negligible) we may replace the 
equation above by (see also Figure 3.15): 


AT s(x))] = t(x) + OF astx,) 
ds 


On the other hand we have: x, =[1, 1]! =5(x,)=Wo +w; +, =0.9 + Ag(x,) 


SEs) = x) + Ag(x,) x,=[1, -1]! =>5(X>)=Wo +w, —w, =0.1 + Ag(x3) 


Thus: x,;=[—1, 1] =>5(X3)= Wp — w, +w, =0.1 + Ag(x3) 








x,=[—1, -1]' =s(x,)=wo—w,—w,=0.1 + Ag(x,) 


For the minimal MSE we must have @E/¢w,=0 for each weight w,. Thus for each 
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je{0, 1, 2}: 


df 


$ Ag(x)—x,5=0 (¢ 
i , : ohn A 


=] in our case and Xio = 1) 
ds 


ds 
we obtain the following relations: 
Ag(x X1) + AglxaX1) + Ag(xs)(1) + Aglxa X1)  =0 


Aglx, (1) + Ag(x2)(1) + AJON — H + Ag(x4)(~1)=0 
Ag(x X1) + Ag(X 2X — 1) + Ag(x 3X1) + Ag(xa X — 1)=0 
From this set of seven equations we conclude: 
Ag(x,)= —0.2 Ag(x,)=0.2 Ag(x3)=0.2 Ag(x4)= —0.2 
w,=0.2 w,=0.2 wo=0.3 
MSE =0.08 a 


The previous example was a demonstration of Theorem 3.5 for the simple case of 
a linear transfer function for the neuron. The following example shows the application 
of Theorem 3.5 in case of a non-linear transfer function. If the reader is convinced 
that the theorem holds for a non-linear transfer function he or she may skip this 
example. 


Example 3.7 
Assume the transfer function of the neuron is given by: 
f(s)=l—-e-™ 
and s=WoXotW,X,, With xọ=l. Thus there is one external input x,. Let 
D={(x, t(x)} ={C, 0.1), (— 1, 0.1), (2, 0.4), (— 2, 0.4), (3, 0.9), (— 3, 0.9)} be the data set. 


(NB: ten times t(x,) yields the square of x,.] For cach extended input 
Ri= (Xio X11) = (1, Xi.) we have an equation of the form: 


2 wass to (Z) Ag(x;) 
k=0 ds 


With f~*())= + [n(1 —1)7 JY? and df /ds=2s[1 — f(s)] we obtain for each of the 
elements [x;, ¢(x,)]JeD a linear equation: 


n PAT 
5, wX = f(x} + Asx) = fax} + (74) Ag(x;) 
k=0 $ 


Approximate fitting of the data set with a single-neuron Perceptron 


We take first f~*{e(x)}= +[nU = 1}! and then —[In(1 sp ye 
Wot w= 0.32+ 1.71Ag(x,) 
Wo- w= — 0.32 —1.71Ag(x2) 
Wot2w,= 071+ 1.17Ag(x3) 
wọ— 2w; = —0.71— 1.17Ag(x 4) 
Wo +3, 1.52 + 3.30Ay(x5) 
Wy — 3w = — 1.52- 3.30Ayg(X6) 


. For each weight wj we obtain an additional equation: 
ee L TE 
CW) xed ds 
Thus: ; 
0.58Ag(x,) —0.58Ag(X,) + 0.8SAg(X 3) — 0.85Ay(X4) + 0.3Ag(x 5) — 0.3441 
0.58Ay(x)-+0.58Ay(x2)+ 1.70Ag(x 3) + 1.70Ag(xX4) + 0.9Ag(x5) + 0.9Ag( 
The solution for this set of equations is: 
Wo=0 w, =0.39 
Ag(x,)=0.042 Ag(x2)= 0.042 Ag(x 3) =0.041 
Ag(x4)=0.041 Ag(x5)= —0.10 Ag(x,)= —0.10 


A 
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Figure 3.16 The MSE for Example 3.7 


x,)=9 
x,)=0 
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If we take first f(s)7'=—[In(l—1)~']7!?, we obtain wọ =0 and w, = — 0.39 with 
the same values for Ag(x;). : 
The MSE as a function of Wo and w; is given in Figure 3.16. a 


: v 
From the discussion above we can conclude the following: 
Practical statement 3.4 


If a data set D can be approximated well enough by a function gẹ, realized by a 
single-neuron Perceptron, then we do not have to learn that function because we 
can calculate the weights beforehand. 


3.6 Generalizing with a single-neuron continuous Perceptron 


In the literature on artificial neural networks one frequently encounters statements 
about the generalization capability of neural networks, such as: ‘Neural networks are 
capable of adapting themselves with the aid of a learning rule and a set of examples 
to model relationships among the data without any a priori assumptions about the 
nature of the relationships.’ A similar statement is: ‘After learning, neural networks 
may be used to predict characteristics of new samples or to derive empirical models 
from examples in situations in which no theoretically based model is known.’ Although 
to a certain extent these types of statements are true, one must be careful with the 
substatements that no model, or no a priori information about the nature of the 
relationship between examples is assumed. 

Generalization is the process of inductive inference of general relationships from 
a finite number of samples. For example, one is inclined to assume that the sun will 
rise tomorrow because we have observed a finite number of times, without any 
counterexample, that the sun has so far risen on every day in our lifetime. Another 
example, already discussed in the preface of the book, is the inference of a new number 
in a finite sequence of numbers: one is inclined to say that the next example in 
sequence 1, 4, 9, 16,... will be the number 25, because we observe a simple regularity 
in the sequence: the kth clement in the sequence is k?. However, without being 
prejudiced in favour of some type of ‘model’ any number may follow the given 
sequence. For example, one might as well say that the next number is 29 because 
one is in favour of the regularity where the nth number y(n) in the sequence is defined 
by y(n) =‘sum of first n primes’, If a large number of people (with some knowledge 
of elementary calculus!) are asked to guess the next number in the sequence given 
above, almost all will say that the next number is 25. This phenomenon reveals the 
human attitude to select always the most ‘simple’ model to explain a sequence of 
experimental observations. 
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Practical statement 3.5 


‘Modelling’ can be defined as the process of formulating a finite set of interrelated 
rules (or the construction of a finite set of interconnected mechanisms) by which one 


can generate the (potentially infinite) set of observed data. The simplicity of the model. 


is very subjective because it depends on the domain of knowledge of the person who i is 
doing the modelling. 


If one is using complex neural networks to ‘model’ relationships behind a given 
set of data, it is hard to demonstrate that one is using (or assuming) a priori information 
about the kind of relationship. In the case of a single-neuron Perceptron one can 
easily show that one needs (or is assuming) a priori information about the relationship 
between the given data in order to be justified in accepting the outcome of the learning 
process. 

If one is using, for example, a sigmoid transfer function gy(x)= {1 +exp[ —sf (x)J} 7! 
and a single-neuron Perceptron with one input, then whatever the data set may be, 
the types of relationships behind the data that one will find with the neural network will 
be restricted to a function from the set of sigmoid functions. All functions realized 
will be increasing (w, >0) or decreasing (w, <0) sigmoid ee of the input x (see 
Figure 3.17). One extreme for w, =0 will be the constant gẹ(x)=[1 +exp(— wo)] ~} 
and the other extreme for w,—> with w,/w, finite will be ie threshold function 

y=1 for x>—wo/w, and y=0 for x< —wo/w, (see Figure 3.18). 

If, for example, the data set contains pairs [x;, t(x,)] from the set {(xp ydlyi =O. oia x?} 
uniformly distributed over a symmetrical x-domain, then the neural net will find after 
learning that the relation between x; and y; is some constant gw(x;)=[1 + exp(— Wo] 7! 
(see Example 3.5). 

Furthermore, the outcome of the neural net depends on the type of transfer function 
used, and the result will also depend on the data set. If, for example, the data again 
contains pairs (x; t(x,)) from the set {(x;, ly; =0.01x7} but now uniformly distributed 
over the x-interval [0,x,,.,], then the neural net will find after learning that 
the relation between x; and y; can be approximated by an increasing sigmoid 


F[s(x)] f 


w,<0 
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Figure 3.17 Decreasing and increasing sigmoid functions 
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Figure 3.18 The extremes of the sigmoid functions 


function g,(x,)) =[1 +exp—(wọo+w:x)}] ~! with wọ<0 and w, >0. For example, if 
x,e{0, 1, 2, 3,4, 8} we will find after learning with the single-neuron Perceptron: 
Gu(X;)= [1 +exp(—4.04+0.58x,)7']. From our discussion we may conclude the 
following statement: 


Practical statement 3.6 


Using a neural network to find the relation behind the data in the data set, one is 
assuming a priori that the relation can be modelled by a function of the class of 
functions realizable with the neural net, and secondly one is assuming that the data 
set is representative of the relation. 


In general, the sigmoid function yg, =[1 +exp—s(w)]~) with s(x)=2 w;x;, is used as 
the transfer function of a ncuron in a continuous multi-layer Perceptron. Sometimes 
other transfer functions are used, such as the following: 


The y=tang h(s(x)). 
The piecewise linear function. 
The radial base function y=exp(—s(x)—,)?/o?. 


The requirements for a transfer function are: 


1. The function is continuous differentiable (because the adaptation Aw is proportional 
to df /ds). 

2. The derivative df /ds is finite, preferably going to zero for large |s]. (The adaptation 
of weights Aw must be small but is proportional to df /ds.) 

3. The function is non-linear (see Theorem 3.1). 


3.7 The classification of data with a single-neuron Perceptron 


Perceptrons are frequently used for classification of data into two or more 
disjoint classes. As an example we refer to the classification of mushrooms into two 
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separate classes discussed in Section 3.1. We now consider again the case of a two- 
class classification problem. It turns out that in many situations we can obtain very 


good classification results for two-class problems with just one single-neuron ` 
Perceptron. In that situation one has two sets A and B of objects. Objects are presented _ 


by the values of n measurements of n different attributes of the objects: 
X=(X1, X2- -s Xn). The classifier realizes a so-called discriminator function D(x) which 
assigns to the measurement vector x the label |, of class A or the label lẹ of class B. 
In general a penalty is associated with a misclassification. Let c(l}, B) be the cost of 
assigning label l, to the measurement vector x if the corresponding object is an 
element of class B, and c(lg, A) the cost for the complementary misclassification. If 
p(A) and p(B) are the probabilities of respectively class A and class B, and p(I,|B) and 
p(lp|A) the conditional probabilities of misclassification, then the expected value of 
the cost of a classification, the risk, will be: 


R= p(A)p(lylA)c(ly, A) + p(B)p(/41B)c(/,, B) 
with p(A)+ p(B)=1 


For an optimal classifier we want to have the risk as small as possible. 

In general we only have a finite set D, of examples of class A and a finite set Dg 
of examples of class B. How do we have to design the classifier if we only have the 
finite set D, of examples of class A and the finite set Dg of examples of class B? Given 
the sets of examples it is likely that the optimal classifier will divide the input space 
X into two regions X, and Xx such that most examples of D, are in region X4 and 
most of the examples of Dg are in Xp. One might, for instance, assume that a 
n—1-dimensional hyperplane will divide the n-dimensional input space X=R’ into 
regions X, and Xp. Figure 3.19 shows the two-dimensional case with: 


Da ={(—3, 3), (1, 3), (—3, 7}, (—7, 3), (—3, — 1), 0,0), (— 1, DF 
Da = {(2, — 2), (2, 2), (6, — 2), (2, — 6), (—2, —2), (0, 0), (— 1, D} 


One might just as well assume, however, that the input space can be divided into 
two regions by means of several hyperplanes (see, for example, Figure 3.20 for the 
same sets D, and Dp). One might even assume that the elements in data set D, are 
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Figure 3.19 A two-dimensional classification problem 
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Figure 3.20 Classification with four hyperplanes 


the only elements of class A, and class B consists of all remaining elements of the 
input space X =R". 

Without any additional information the less-biased assumption is that any vector 
xeR" according to some probability ratio may belong to class A as well as to class B. 
Accepting this assumption, one has to have recourse to the assumption that given 
a measurement vector x, there is a conditional probability p(A|x) for each x that x 
belongs to an object of class A and a conditional probability p(B|x)= 1 —p(A]x) that 
x belongs to an object of class B. 

The discrimination function of the optimal classifier must be: 


D(x)=1, if p(A|x}/p(B|x) > T 
D(x)=1, if p(A|x)/p(B|x) < T 


For a minimal risk the threshold T will depend on the costs c(!,|B) and c(/,|A) of 
misclassification. 
For the continuous case we have: 
Je(x) p(B) 


x) A 
en ) p(B}x) = Too (Bayes’s rule) 


with f,(x) and f,(x) the class-conditional probability density functions. 
With the use of these expression for the a posteriori probabilities, we can rewrite 
the discriminator function as follows: 


D(x)=ly if fa(X)/fo(x) > T-p(B)/p(A) 
D(x)=1y if fa(X)/fo(x) < T-p(B)/p(A) 


The quotient f,(x)/ f(x) is called the likelihood ratio L for class A. The quantity 
K = T-p(B)/p(A) is called the likelihood ratio threshold. In cases where the costs c(1,|B) 
and c(/,|A) are equal, we have to select the threshold T (and thus K) such that the risk R: 


R = p(A)p(In|A) + p(B)p(!4|B) 


p(AIx) 





-Í waars | p(B) a(x) dx 
x l 


xy X, 
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Figure 3.21 The risk (shaded area) for a one-dimensional classification 
problem 


is minimal. The risk for a one-dimensional input space with p(A) = p(B) is given by 
the shaded area in Figure 3.21. 

In order to design an optimal ‘Bayes’ classifier, one has to assume (or to know) the 
class-conditional probability density functions f,(x) and f(x), and in addition 
the class probabilities p(A) and p(B). The first condition for the derivation of the 
optimal ‘Bayes’ classifier implies that one has to assume the functional form of the 
density functions and to estimate the parameters (mean and variance) of the density 
function from the sets of examples D, and Dg. 

In the following sections we will see that if we use a continuous Perceptron as a 
classifier we do not have to make such severe assumptions about the functional form 
of the density functions, and also that we do not have to make estimations of the 
parameters of the density functions. We will also show in the subsequent sections 
that under certain conditions a single-neuron Perceptron will be an optimal classifier. 

A single-neuron Perceptron assigns to every measurement vector x an output value 
gu(x) depending on the weight vector w. If we use the sigmoid transfer function 
f(s)=[1 +exp(—s)]~! with s =E w;x; then the output gy(x) will vary between 0 and 1. 

We can use the output gẹ„(x) as a label for the class to which x belongs. If we use 
the sigmoid function as the transfer function of the neuron, then there are three 
different ways of learning to use the output g,(x) as a label for class A and class B: 


1. Hyperplane boundary classification by learning one and zero labelling. In this case 
the n-dimensional input space X = R” is divided by a n — 1-dimensional hyperplane 
defined by E w;x;=0 into the regions X, and Xp. For xeX, we have gy(x)>0.5 
and for xeX, we have g,(x)<0.5. In the learning phase for xeD, the target value 
t(x)=1 and for xeD, the target value t(x)=0. So we have for the discriminator 
function D(x)=[, if g(x)>0.5 and D(x)= lg if g.(x)<0.5 if p(A)= p(B). 

2. Hyperplane boundary classification by learning double threshold labelling. In this 
case too the n-dimensional input space X =R" is divided by a n— 1-dimensional 
hyperplane defined by E w;x;=0 into X, and Xp. For xeX, we have gy(x)>0.5 
and for xeX, we have g,(x)<0.5. In the learning phase, however, for an input 
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xeD, the target value is (x) >a with 0.5<a< 1 (eg. a=0.9) and for an input xeD, 
the target value is (x) <b with 0<b<0.5 (e.g. b=0.1) given: p(A)= p(B). 

3. Hyperplane boundary classification by single threshold labelling. Again the 
n-dimensional input space 'X =R" is divided by a n— 1-dimensional hyperplane 
defined by Zw,x;=0 into X4 aid Xp. For xeX, we have gy(x)>0.5 and for xeX, 
we have gy(x)<0.5. In the learning phase, however, for an input xeD, the target 
value is t(x) > 0.5, and for an input xeD, the target value is t(x) <0.5. We will sce that 
in the last case we have to modify the learning rule as discussed in previous sections. 


in the next section we will discuss the different classification methods. 


3.8 Hyperplane boundary classification by one—zero labelling 


In case of hyperplane boundary classification by one-zero labelling with a 
single-neuron Perceptron with sigmoid transfer function, the n-dimensional input 
space X =R" is divided by a hyperplane defined by the dot product ŴR =0 into the 
regions X, and Xp. In the learning phase for xeD, the target value t(x)= 1, and for 
xeD, the target value ((x)=0. After learning, we use g(x) as a label for the class to 
be identified: xeA if y,(x)>0.5 and xeB if gy(x) <0.5 if p(A)= p(B). 

We observe that during learning we have to learn for each x the target value (x), 
thus the learning goal is identical with the approximate fitting of a data set D=D,uUD, 
as discussed in Section 3.5. Therefore we can use the same learning rule as discussed 
in the previous sections. 

If the data set is linear separable by a hyperplane defined by the dot product 
W-%=0, then the final weight vector will be such that the functions g(x) becomes 
(almost) a threshold function and the final MSE becomes zero. Figure 3.22 shows 
the one-dimensional case. The ‘hyperplane’ wọ +wx; =0 will be in this case a point 
t=—W/W,. 

One may note that if the data set is linear separable we also have the same problem 
as discussed in Section 2.4 on classification with a single-neuron binary Perceptron. 
The desired output is also zero or one, and only the input vectors are now real valued 
instead of binary valued, but the convergence theorem is independent of the type of 
input vector. Thus to solve the given classification problem we can also use the 
reinforcement learning rule of Section 2.4. 

We say that the data set D is separable if the data sets Da and Da do not have 
input vectors in common. Most data sets are separable but are not linear separable, 
ie. we cannot divide the input space X by one single hyperplane into regions Xa 
and Xp such that DAG Xa and DpZ Xp Although most data sets are not lincar 
separable we can frequently obtain very good classification results with one 
single-neuron continuous Perceptron. 
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Figure 3.22 The sigmoid function becoming a threshold function 


Example 3.8 


Assume we have a two-dimensional data set D=D,UD, depicted by the points in 
Figure 3.23. 
The data set is D, generated by a Gaussian distribution function with the following 
probability density function: 
1 Seyi. 1 


exp + 
2n0, 2o 


—(y—H,) 





falx, y) = 


exp 
270, 


with p, =0,0,=0.2, u, =0.465 and o, = 0.4. Similarly we have the data set Dg generated 
by the same type of Gaussian distribution function but now with p,=0, o, =0.4, 
ft, = — 0.465 and o, =0.2. 

For an optimal minimal risk classifier one can prove that the boundary between 
the two regions if X, and Xg is defined by the condition that falx, y)= fax, y), if 
p(A) = p(B) and c(/,|B) = c(Ig|A). The optimal classification boundary, or discrimination 
line is given in Figure 3.22 by a curved line. One can calculate that the probability 
of error in that case is 5.14 per cent. 

We can also divide the input space by one straight line (the boundary hyperplane 
is in this case a straight line given in Figure 3.22); the probability of error will then 
be slightly larger. The optimal position of the line is y= —0.11. It turns out that the 
probability of error in that case is 5.15 per cent. If a single-neuron Perceptron could 
find this boundary we have a very good result. In Section 3.11 we will show that 
with a single-neuron Perceptron we can learn to find this boundary. a 


we ¢ 
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Figure 3.23 Two-dimensional classification problem with Gaussian 
distributions 


The preceding example illustrates the following practical statement: 


Practical statement 3.7 


Many two-class classification problems (in which the optimal classification boundary 
is an open, non-linear, convex boundary and the intersection area of both classes is 
not too large) can be solved reasonably well with one single-neuron continuous 
Perceptron. 


Classifiers that divide the n-dimensional input space by a n—1-dimensional 
hyperplane will be called hyperplane boundary classifiers. The single-neuron 
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Perceptron is an optimal hyperplane boundary classifier if a certain condition is 
satisfied: 


Theorem 3.6 ; 4 F l 
uny 


_If the data sets D, and D, are representative for the underlying distribution functions 


of class A and class B, then the single-neuron Perceptron will divide after learning 
with one-zero labelling the n-dimensional input space X =R” by an optimally located 
n—1-dimensional hyperplane if for the final weight vector: |w|— oo. 


Proof 


In the previous section we found that for an optimal classifier the risk (for equal 
costs of misclassification): 


R = p(A)p(l |A) + p(B)p(/,|B) 
= | P(A) f(x) dx + f p(B) fa(x) dx 
Xa xX, 


must be minimal. For the one-dimensional case with p(A)= p(B) the shaded area in 
Figure 3.24 must be minimal for a minimal risk. 

By training a single-neuron Perceptron to fit the data set D = DY Dg by one-zero 
labelling, the MSE will be minimized. We have to prove that by minimizing the MSE, 
the risk will also be minimized. 

In Section 3.2 we have defined the MSE as: 


1 
sa È nxx) — gx)? 


x,eD 








— 
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Figure 3.24 The risk (shaded area) for a one-dimensional classification 
problem 
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with N the number of samples in the data set D, and n(x,) the number of times input — 


x; occurs in the data set. 


For input vectors x,eD, the target is t(x;)=1, and for input vectors of D, the target. 


is t(x;)=0. Thus: 7 : a A 


l aa afd 
E= N D n(x) —gwlX))]? + N Y ny(x)0—gwlX)]? 


xED, xeDy 
with N the number of samples in the data set D=D, UD, and n,(x;) (respectively ny) 


the number of times input x; occurs in D, (respectively Dy). We can rewrite the above 
equation as: 


IA naX ye Ny ny(x;) 2 
Ey > EA [1 ~galx)]? + Ne Law(x,)] 


with N, (respectively Np) the total number of samples in D, (respectively Dy). 

If the data set is representative of the underlying distributions, then N,/N is an 
estimate of the class probability p(A) and n,(x,)/N,q is an estimate of the probability 
S,(x,) dx of finding an input vector in the infinitesimal n-dimensional cube dx 
surrounding x;. Thus the discrete sum above is a discrete approximation of: 


E= | P(A) SA)! — gal)? dx + | p(B) falx)[gu(x)]? dx 
X X 


After learning, the input space X =R" will be divided by the hyperplane defined 
by W-X = 0 into the regions X 4 and Xp. Thus we can replace the equation above by: 


E= | P(A) SaL! — gw(x)]? dx + | P(A) fal — g(x)? dx 
Xa 


Xy 


+ | P(B) falx)Lyw(x)]? dx + | p(B) folx)Lgw(x)]? dx 
Xa Xa 
If, after learning, |w} is large (|w| 00), then g,(x) will become a threshold function 
with g,(x) 21 if xeX, and g,(x)<0 if xeX,. The equation for the MSE reduces to: 


E= | P(A) fa(x) dx + | p(B) f(x) dx 
Xn X, 
This expression is identical to the expression for the risk where there is an equal cost 
of misclassification. We know that the gradient descent learning rule will minimize 
the MSE and thus it will also minimize the risk. 
One easily verifies that if gy(x) = p(A) fa/(p(A) fa (x) + p(B) fe(x)) then the MSE, will 
be minimal. QED 


To understand what is happening during learning, we will analyze some simple 
classification situations. 
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Weighted input, s 


Figure 3.25 The sigmoid transfer function f(s) 


According to Theorem 3.3 the adaptation of the weight vector w must be: 


ase Y [ga] ER 
xeU ds 
with g(x) = f[s(x)], W=LWo. Wis... Wp]! the extended weight vector and Å= 
Ll. Xi Xiz- Xin] the extended input vector. (Note that we have to multiply the 
equation above by n(x;)/N in the most complete form.) l , 

The contribution of different input vectors to the adaptation of the weight vector 
will be different. For an element of D, the target value is 1. The value of t(x) — f [s(x))] 
is large for large negative values of s(x;) and will become zero for large positive values 
of the weighted input (see Figure 3.25). The factor df /ds= f(s)[1—f(s)] has a 
maximum of 0.25 for s=0 (see Figure 3.26). The product: 


df 


x(x) = {t(x)— Fisted 5 


will be called the internal learning rate for x;. l l 
For t(x;)= 1 the internal learning rate as a function of the weighted input s(x,) will be: 


(s)=1 — f(s) f(T S] 


and is given in Figure 3.27 by curve A. 
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The learning rule can now be written as follows: 
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Perceptron with two external inputs x 


represented by curve B in Figure 3.27. 
Wo twix; 


The learning rule implies: 
(wot w:x; 


102 
and Dp 
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is the same as the distance from x, to the hyperplane. At first glance one might 


assume that learning will not change the weight vector because the initial hyperplane . 


separates in an ideal way the input space, and the two inputs will thus be classified 
correctly [g,{x,)>0.5 and GwiXp) XO}. However, at the first learning step the targets 
t(x,)=1 and ¢(x,)=0 might not be satisfied and the MSE will not be zero. During 
learning, the weights will be adapted such that the MSE decreases while keeping the 
classification correct. 

Because x, and x, are on the opposite sides of the hyperplane we obtain for the 
weighted inputs: s(x,) > 0 and s(x,) <0. Now we recall from Section 2.2 that the weight 
vector w=[W,, W2]' is perpendicular to the separating plane and is pointing into the 
direction of the region X, where s(x)>0. From Section 2.2 we recall the relation 
between the distance 6(x) (in the direction of w) from the hyperplane to a point x, 
and the weighted input s(x): 

5(x) = A el 

|w] |w] 

Because 6(x,) = — 6(x,) we have s(x,)= — s(X,) and thus y(x,)= — 7(x,) (see Figure 3.28). 
The adaptation of the weight vector: 


Aw = E(X a)Xa + £7 (Xp)Xp = E(X (Xa = Xp) 


is in the direction of x, ~ Xp and thus in the direction of the weight vector w. Thus 
the weight vector is multiplied by some scalar. Because the weight vector w before 
and after adaptation is in the same direction, the orientation of the separating 
hyperplane will also be the same. 

We recall from Section 2.2 that the distance along w from the origin to the 
hyperplane is given by: 


— Wo 





|w] 


Because y(x,) = —7(X,) we have Awy=ey(x,) +€p(x,)=0. Thus before and after 
adaptation wọ will be the same. We found that |w] will be different before and 
after adaptation and thus the distance d from the origin to the hyperplane will change. 
The hyperplane comes closer to x,. ; 

As we continue learning we observe that the hyperplane twists around the initial 
hyperplane but still between x, and x, (if £ is small enough) while |w| increases, i.c. 
until g(x) becomes a threshold function and the separating hyperplane will be the 
same as the initial one but now with zero MSE. 


Example 3.9 


Consider the simple one-dimensional classification problem: D, = {x,y ={3; and 
Dy={x,)={1)}. The initial extended weight vector is W=[wo, w,]=[-4, 2] (see 
Figure 3.29). 
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g(x)=f[s(x)} 











Figure 3.29 The initial realized sigmoid function for Example 3.9 


The initial separating hyperplane (a point) is defined by —4+42x,=0. For the 
weighted inputs we find s(x,)= —4+ 2:3 =2, s(x,)= —4+ 2:1 = —2. From Figure 3.27 
we obtain for the internal learning rates y(x,)=0.01 and >(x,)= — 0.01. We find for 
the value of the adaptation of the weight vector: 


Awo}_ Xa0 B Xbo | _ 1 2, 1} a 
[2] soe] + evo | = soon | + e( oon | =el me 


With ¢=10 the new weight vector becomes: 


[e] Ei 

wi 2.2 

The separating point —wo/w, = 1.9 is moving to the left and w, is increased. An 

additional adaptation will show that the next separating point will be located to the 

right of the original separating point x =2. Finally we will end up with |W|=0o and 

—w,/w, =2. During learning we will jump in the cutter of the error landscape (see 

Figure 3.30) from one slope to the other until wọ and w, approach infinity and the 

MSE becomes zero. a 
If for x,€D, and x,¢Dg we have x,=x,=x, the contribution of those points to 

the adaptation will be: 


Aw = 6)(X4)Xa + E7(Xp)%p = ELH(Xa) — 71%) ]x 


The value of 7(x,)—7(x,) is given by curve ‘A and B’ in Figure 3.26. We see from 
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Figure 3.30 The error function of Example 3.9 


Figure 3.26 that if the two points coincide on the actual separating hyperplane, then 
they do not contribute to the adaptation. 

If the two coinciding points are on the positive side of the hyperplane 
(s(x,)= s(x,)>0), they will be treated as a point of Dg. If the two coinciding points 
are on the negative side of the hyperplane (s(x,)=s(x,) <0), they will be treated as a 
point of Dy. 

We conclude that the contribution of a data point x; to the adaptation of the 
weight vector depends on the internal learning rate y(x;), and the internal learning 
rate depends on the weighted input s(x,), while s(x;)}=6|w| with ô the distance from 
the hyperplane to the point x, From Figure 3.27 we conclude that the internal 
learning rate for x; is relatively small for |s(x,)]> 5. This implies that data points with 
an actual output g(x,)= 1 or gy(x,)=0 hardly contribute to the adaptation, while 
on the other hand they consume learning time. In the next section we will give a 
method that eliminates those points automatically from the learning set. 


3.9 Hyperplane boundary classification by double threshold 
labelling 


In the case of hyperplane boundary classification by double threshold labelling with 
a single-neuron Perceptron with sigmoid transfer function, the n-dimensional input 
space X =R" is also divided by a n ~ t-dimensional hyperplane, defined by the dot 
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product W-X=0, into the regions X, and Xp. For xeX, we require g,(x)20.5 and 
for xEXg, g(x) <0.5 if p(A)= p(B). 


_ In the learning phase, however, we try to label each element of D, with a target i 
value :(x;)>a with 0.5<a<1 (e.g. a=0.9) and we try to label each element of Dg 


with a target value t(x)<b with 0<b<0.5 (e.g. b=0.1). Thus if during learning 
gu(X;) 2 a for x,eD,q, then the squared error is zero, and if gẹ(x;) <a, then the squared 
error is fa—g(x,)}?. i 

If x,eD, we have for the target value t(x;)<b, thus if during the learning phase 
Yu(X,) <b, then the squared error is zero, and if g,(x;)>, then the squared error is 
(b— g(x) 7 

This learning rule implies that there might be elements of D for which we do not 
have to adapt the weights because the error is zero. This is of great advantage for 
the speed of the learning process (especially for networks with many neurons) because 
in general the data set will be large and the adaptation of weights is time consuming 
if every element of the data set D is contributing to the adaptation of weights, as 
with the classification procedure of the previous section. In cases where the data set 
is linear separable (see Figure 3.31 for the one-dimensional case), classification with 
the double threshold boundary with a final MSE=0 is optimal. 

Classification by learning with double threshold labelling will not always be optimal 
(but will still be very good) if the data set is not linear separable because in that case 
the final weights will not become very large and thus g,(x) will not become a threshold 
function as required for optimal classification (see previous section). l 

These observations and Practical Statement 3.5 lead to the following practical 
statement. 


fis(x)] l 








0.9 








Figure 3.31 Classification of linearly separable one-dimensional data 
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Practical statement 3.8 


Many two-class classification problems can be learned quickly and be solved 


reasonably well (if not always h optimally) with one single-neuron continuous 
Perceptron if we use the classified tiqn method of double threshold labelling. 


With the method of classification of double threshold labelling, only a subset 
DawSDa and a subset Dy, S Dp (depending on the actual weight vector w) of the 
given data sets DSX, and DySX jy are wrongly classified, and only these inputs 
will contribute to the adaptation of weights. 

Given some weight vector w we will minimize the MSE E for elements in the set 
Daw Day: 


l | 

E=;} È nxa- yx] E nolh — oats} 
N ED aw X,EDyw 

with N the number of elements in Da,UDpy and n(x;) the number of times the 

elements x; occur in Day respectively in Dgw. Because N is a constant and assuming 

n(x) = 1 or a(x;)=0, we can simplify the expression to: 


E= $ [e-us] + Yo [bgd] 
weDaw xD iw 
To minimize E, according to Theorem 3.3, the adaptation of the weight vector w 
must be: 


arf 2 ere Es > pgo Ea 
KtDan ds x€Diw ds 

with gax) = f(s(x,))), W= [wo Wy... w,]' the extended weight vector and X;= 
El, Xp. Xiz- -s Xin)’ the extended input vector. 

The given learning rule is correct as long as the set DaywUDgy is constant. After 
adaptation of the weight vector w the set DawUDyy may be changed into a new set 
DawVU Daw because w is changed into w. Thus we subsequently minimize the MSE 
for a sequence of wrongly classified sets. While in the case of classification with 
one-zero labelling the MSE will gradually decrease during learning (if the learning 
external rate ¢ is small cnough), it may happen that with this classification method 
jumps in the decreasing curve of the MSE occur during learning, because the sets of 
wrongly classified examples may change abruptly after the adaptation of the weight 
vector w. 

The product: 


af 
(x) =| ux) — ren 
ds 


is called the internal learning rate for x; The contribution of different elements of 
the data set D to the adaptation of the weight vector can be quite different, as 
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Figure 3.32 Internal learning rate for t(x;)=0.9 (curve A) and t(x,)=0.1 
(curve B) 


expressed by the internal learning rate. For a wrongly classified element of D4 the 
target value is, for example, 0.9. Because df /ds= f(s)[1 — f(s)] the internal learning 
rate as a function of the weighted input s(x;) for the elements of Day will be: 


7s) = [0.9— f(s) f(9)01 — f(3)] 


and is given in Figure 3.32 by curve A. The curve is only used for values of s to the 
left of the solid vertical line because at the right side f(s)>0.9 and no adaptation 
will occur. 

For a wrongly classified element of Dg the target value is, for example, 0.1. The 
internal learning rate is: 


y(s)=[0.1— f(s f(T — F(5)] 


represented by curve B in Figure 3.32. The curve is only used for values of s to the 
right of the solid vertical line because at the left side f(s)<0.1 and no adaptation 
will occur. 

The learning rule can now be written as: 


AW= È £7(X)X;+ È £Y(K)X; 
x€Dan x€Den 


The first sum represents a weighted sum of wrongly labelled vectors from Daw. The 
second sum represents a weighted sum of wrongly labelled vectors from Dpy. Let us 
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denote the first sum by $, and the second sum by X,; then we can write the adaptation 
as follows: 


© AW=K, +p 
iA 
If it happens that Rat ky = Aw, then after adaptation the new value of the weight 
vector becomes W+ AW =(I + /4)W and the separating hyperplane, defined by s(x) =0, 
will be the same before and after adaptation but the MSE will be reduced. 


Example 3.9 


Let Dy={x,,X2,x3} with x,=(I,2), x,=(—2,—1) and x,=(—1,2). Let Dy= 
{Xas Xs, Xo} with x4=(2, 1), x,=(—1, —2) and x,=(2, —2). Given is DSX, and 
DySXy. 

With a single-neuron Perceptron we want to separate the two-dimensional input 
space with a one-dimensional hyperplane defined by s(x)=wo + w x, +wx, =0 (a 
line) into the region X, and Xp such that gy(x) > 0.5 for xeX, and g,,(x) <0.5 for xeX p. 

Assume the initial weights are wo =0, w, = —1 and w,=1. The initial separating 
line defined by s(x)= —x, +x, =0 is given by the solid line in Figure 3.33. Although 
all points of D, and Dy are already correctly classified, the output targets for learning 
are not satisfied. 





Figure 3.33 Two-dimensional classification problem of Example 3.9 before 
adaptation 
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We calculate the weighted inputs as follows: 
sx J=(-1()+(D2Q) =i 


s(xa)=(-1X=23+(1X-1)=! si 
sxa) =(— 1X- D+) =3 D 
sxd= DDO =! ' 

(x5) =(— 1X- 1)+0X-2= -1 

sxe) =(- 1+0- =-4 


With the help of Figure 3.25 we find: 


Gw(X ,) = 9.73 
Gw(X 2) = 0.73 
Jv(X 3) = 0.95 
Gw(X4) = — 0.27 
Gw(X5) = — 0.27 
Gw(X6) = 9.01 


We see that only for x, and for x, are the learning targets satisfied. We thus have 
to adapt the weights. From Figure 3.32 we obtain: 


7X1) = 7(X2) = 0.03 
i(X4) = 7(X5) = — 0.03 
With the learning rule we obtain: 
AW = €5(X 1 )% + E7(X 2) 2 + p(X 4)Kq + EP(X5)X5 


With e=10 we obtain: 


1 l 1 1 0 
A®=0.3| 1 |+0.3| —2 |—0.3| 2 |—0.3| —1]—|] —1.2 
2 -1 1 —2 1.2 


The new weight vector becomes: 


0 
w=| —2.2 
22 
This new weight vector defines the same separating line s(x) = —2.2x, +2.2x, =0 but 


now the target values are satisfied. If we compare Figure 3.32 (before adaptation) 
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Figure 3.34 Two-dimensional classification problem of Example 3.9 after 
adaptation 


with Figure 3.34 (after adaptation) we observe that the lines with constant values of 
sare moved towards the separating line. For s=2.2 we have f(s)=0.9 and for s= — 2.2, 
f(s)=0.1: the realized function becomes almost a threshold function. r 


3.10 Hyperplane boundary classification by single threshold 
labelling 


In the previous section on classification with one-zero labelling as well as with double 
threshold labelling we used the output of a single-neuron Perceptron to classify the 
inputs: if g(x) > 0.5 then xeX ,, and if y(x) <0.5 then xeX p. However, during learning 
with the given learning sets the targets were not equal to t(x)>0.5 for xeD,SX, 
and t(x)<0.5 for xeDa E Xp. The question arises as to why we did not use the same 
conditions for learning as for classification. The learning rule is implied by our goal 
to minimize the MSE: 


E= [x)= gx] 


xD 


with the learning rule: 


Awae F 1x) gatx], 
ds 


xed 
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1 
the weights are modified such that the MSE becomes as small as possible. Suppose 
we require during learning that t(x) >0.5 for xeD, and t(x)<0.5 for xeDy,. If at some 
stage during learning the output gy(x) would become equal to 0.5 for all xeD,UDp,, 
then the MSE would be zero and our goal would be reached. An output w(x) =0.5 
for all xeD,UD, can be realized if all weights are zero because then s(x)=0. Thus 
during learning the weights will become zero and the only global minimum of E=Q 
is reached. In that case all elements of D, are misclassified and all elements of Dy 
are correctly classified. Thus we cannot use the single threshold classification method 
without additional precautionary measures. l 
We must realize that the learning rule wili always minimize the MSE but that 
criterion is not always identical with minimizing the number of wrongly classified 
elements of the data set D. 
From this analysis we can conclude as follows: 


Practical statement 3.9 


A small value of the MSE docs not always imply that the classification with a 
single-neuron Perceptron is correct. 


Although classification of data with the unmodified single threshold labelling is 
dangerous because of the small values of the final weights, we can still get reasonable 
results because the separating final hyperplane depends on the quotients of the (small) 
values of weights. We will give an illustration in the following example. 


Example 3.10 


Assume we are given a two-dimensional data set D = D,UDy depicted in Figure 3.35. 
With 

D, ={(—3, 3), (1, 3), (—3, 7} (—7, 3), (— 3, — 1), (0, 0), (- 1, 1)} 

Ds = {(2, —2), (2, 2), (6, ~ 2), (2, — 6), (—2, — 2), (0, 0), (— 1, 1} 
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Figure 3.35 Two-dimensional classification problem of Example 3.10 with 
final desired separating hyperplane S, 
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We have the target valucs: t(x,)>0.5 if xeD, and t(x,)<0.5 if x;eDp. We take a 
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Table 3.1 





X; S=Wo Wi X, + WX, g(x;) t(x;)—g(x;) ef/és 
=3,3 —0.3+06= 0.3 0.57 0.00 0.25 
1,3 0.1 06 = 0.7 0.66 0.00 0.23 
—3,7 -O3414=1h1 7 0.75 0.00 0.19 
—7,3 © —0.7+0.6=~0.1 0.48 0.02 0.25 
—3,-1 ~0.3-0.2= —0.5 0.38 0.12 0.24 
0,0 =0.0 0.50 0.00 0.25 
-1,1 —0.1 +0.2=0.1 0.52 0.00 0.25 
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Figure 3.36 Two-dimensional classification problem of Example 3.10 with 


initial Sy and final desired separating hyperplane S; 


single-neuron Perceptron with the sigmoidal transfer function: 


Let: 


then 


In Table 3.1 and Table 3.2 we have calculated for the sets D, and D, the values 


PN = ae 


(x)= [s(x] 


df 


Awe FE- f(s)] 
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Table 3.2 

Xi S=Wo HWX; +W2X2 g(x) t(x;)— g(x) ag/ds 

2, —2 0.2-—0.4= — 0.2 0.45 0.00 0.25 

2,2 0.2+04=0.6 0.64 —0.14 0.23 

6, —2 0.6-0.4=0.2 0.55 — 0.05 0.25 

2, —6 0.2—1.2= — 1.0 0.27 0.00 0.20 ` 
—2, —2 —0.2—0.4= — 0.6 0.35 0.00 0.23 

0,0 =0.0 0.50 0.00 0.25 
-i,1 —0.140.2=0.1 0.52 — 0.02 0.25 





of s(x), gJa(Xi) (X))—Gy(x,) and df /ds for initial weights wọ =0, w, =0.} and w, =0.2 
POr ORINE to the initial separating line Sọ given in Figure 3.36. Table 3.1 shows 
values for D, with t(x;)>0.5, and Table 3.2 for Dy with t(x;) <0.5. 

The misclassified extended input vectors from D, are Day = {[1, —7, 3], [1, —3, —1]}. 
The misclassified extended vectors from Dp are Dgs = {[1, 2, 2], [1, 6, —2], 1, — 1, 19} 
(see the shaded area in Figure 3.36). 

The adaptation of the weight vector must be: 


d df, 
suena? | {0.5 = g(x ža we b (0.5—gutnn} 2A, 
Thus: 
1 1 
Aw =¢-0.02-0.25-| —7 | +e£0.12:0.24| —3 
3 -1 
1 1 1 
—€-0.14-0.13+] 2 | —€°0.05°0.25+| 6 | — €°0.02°0.25+) — 1 
2 -2 1 
With ¢=0.5 we obtain: 
—0.01 
Aw=| —0.13 
—0.03 
Thus we obtain for the weight vector after adaptation: 
0.0 —0.01 —0.01 
W,=| 0.1 |+| —0.13 |=} — 0.03 
0.2 —0.03 0.17 


ee i 
‘a Y 
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Figure 3.37 Two-dimensional classification problem of Example 3.10 with 
initial separating hyperplane So, the separating hyperplane after 
one learning step S, and the final desired separating hyperplane 
Si 


The equation for the separating line (see line S, in Figure 3.37) will be 
—0.01 —0.03x, +0.17x, =0. For simulation results see Figure 3.38 (global learning) 
and Figure 3.39 (local learning). r | 


In Figure 3.39 the results for local learning of a simulation experiment for the 
data given at the beginning of this example are given. After the presentation of a 
single example of the data set the weights are adapted. An element of the data set 
is called example k if that clement is the kth element in the enumeration of the tables 
given before. We sce that given the initial separating line, the first three examples 
are classified correctly and thus will not contribute to the adaptation of weights at 
that stage of learning. a 

One way to circumvent the problem of bad classification with the single threshold 
labelling method is using a target value for xeD slightly higher than 0.5 (c.g. 0.51) 
and a target value for xeDy slightly smaller than 0.5 (e.g. 0.49). However, we are then 
in fact using the double threshold method as described in the previous section. 

Another way to prevent the weights becoming zero and still use the target t(x)>0.5 
for an element of X, and t(x)<0.5 for xeX y is to multiply after each adaptation step 
the weights with a scalar such that |W] remains constant. Multiplying all weights 
with the same scalar does not change the separating hyperplane defined by the dot 
product W-X=0. By keeping |W! constant during learning we are searching for a 
minimum of the MSE for values of Won a hypersphere in the solution space W. We 
know, however, that the value of [Wi must become infinite to reach an optimal solution 
(sec Section 3.7). If we divide the learning process in a sequence of time intervals such 
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Weights are changed after all 14 examples 
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Epoch 100 


Epoch 20 
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Figure 3.38 Simulation result after several epochs (=number of times the 
total training set of fourteen examples is supplied) of global 
learning 


that |W| is kept constant in each interval until a minimum for the MSE is reached 
and then change |w| subsequently to a larger constant value for the next sequence, 
we systematically search through the complete solution space W and end up with the 


required large value of |w\. 


Example 3.11 


Suppose we have one neuron with one input x, and we have the following data sets 
D,={—2.0, —1.9, — 1.8, — 1.7, 1.95} and Dg= {0.5, 0.6, 0.7, 0.8}. Note that the two 
sets are not linear separable. After learning with single threshold labelling we want 
to have for the output of the neuron g,(x)>0.5 for xeD, and glx) < 0.5 for xeD,. 
An optimal solution for the separating hyperplane wọ + w,x, =0 (a point in this case) 
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Weights are changed after all 14 examples 














Epoch 1, example 4 Epoch 1, example 5 























Epoch 2, example 14 Epoch 2, example 14 

















Epoch 10, example 14 Epoch 20, example 14 Epoch 100, example 14 


Figure 3.39 Simulation result after several epochs (=number of times 


fourteen randomly chosen examples of the total training set are 
supplied) of local learning 
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is obtained when —w,/w, has a value between —1.7 and 0.5; in that case eight 
elements will be classified correctly and one element is misclassified. 

If we use the unmodified single threshold labelling method, the weights will indeed 
go to zero but the separation point — wọ/w, goes to a constant value. There are two 
separation points to which the network converges; which one is reached depends on 
the initial weights. The values of —wo/w, will be either 1.06 or — 2.67. In both cases 
five inputs are misclassified. i 

In order to see why we obtain these solutions we calculate the error measure E 
with W on a circle around the origin in the wo-w, plane for a small value of |W|. We 
take wo=|w| cos @ and w, =|w| sin @ and vary ġ from 0 to 2x rad. In Figure 3.40 
we see that for |Wj=1 we obtain two minima for the MSE - one at ¢=5.53 rad 
(corresponding to the separating point —Wo/w, =|w| cos $/|w| sin d= —0,073/—0.068 = 
1.06), and a second at @=3.50 rad (corresponding to a separating point 
—w,/w, =0.094/ — 0.035 = — 2.67). 

When we use constant |w| during subsequent time intervals of learning as described 
above and start the first learning interval with |w| = 1, we will find one of the solutions 
mentioned before. If we use |w|=2 in the second learning interval, then we will again 
find two possible solutions: one larger than — 2.67 and the other smaller than 1.06. 

Which solution is found depends on the solution found in the previous time interval. 
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Figure 3.40 The MSE for example 3.11 with |wi=1, wo=lwicos 4, 
w,=|wi sin @ and 0< ¢ <27. 
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Figure 3.41 Me ae for Example 3.11 with respectively |w|=1, |wj/=2 
wi=5 and |w|=100. With w,= =lwi'sic i 
ERT o=lw| cos ġ, w, =|w]|sinġ and 


This means that both solutions move in the direction of the correct interval betwee 
—1.7 and 0.5. This phenomenon becomes clear from Figure 3.41 where we tele 
the MSE for several values of |W]. When |W| becomes infinite we see that the pan 
of the MSE will be in the interval for ġ between 3.67 and 5.18 rad, corres Siding 
to the separating points — 1.7 and 0.5. A closer‘look reveals that there will be o : 
minimum Just to the left of 6 =5.18 rad (corresponding to the point 0 5) i 


3.11 Application to the classification of normally distributed 
classes 


PAER 3.8 we presented a classification problem with a two-dimensional data set 
=D,UDy depicted by the points in Figure 3.42. The data set is Da generated by 
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Figure 3.42 Two-dimensional classification problem with two Gaussian 
distributed data sets 


a Gaussian distribution function with probability density function: 


l —(x— u,)}? 1 —(y—4,) 


SAX Y= exp F exp 
^ V 216, 


20% 2no 

with p,=0, ¢,=0.2, p, =0.465 and o, =0.4. 

Similarly we have the data set Dg generated by the same type of Gaussian 
distribution function but now with u, =0, ¢,=0.4, u, = —0.465 and o, =0.2. 

Although the optimal classification boundary, or discrimination line, is a curved 
line, with an error of 5.14 per cent, we can divide the input space by one 
straight line such that the probability of error will then be slightly larger: 5.15 per 
cent. The optimal position of the line is y= — 0.11. If we use a single-neuron Perceptron 
and use the one-zero labelling method with 100000 examples generated by the 
distribution we will find an almost horizontal separation hyperplane located at 
y=—0.11 with an error of 5.15 per cent. 








Laat 
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When we use the method of double threshold labelling with targets 0.9 and 0.1 we 
will again find an almost horizontal Separating hyperplane located at y= —0.11 with 
an error of 5.18 per cent. A typical example of the final weights is wg = — 0.7280, 
w,=—0.0971 and w,=—63304, corresponding with a separating line y= 
0.015x—O.115. ay 

When we use the method of guble threshold labelling with targets 0.51 and 0.49 
we finally again find an almost horizontal separating hyperplane located at y= —0.15 
with an error of 5.39 per cent. 

When we use the unmodified single threshold labelling method the position of the 
final, almost horizontal separating hyperplane will be located at y= —0.176 and we 
obtain an error of 6.66 per cent. As a typical example we will find for the values of 
(small) final weights: wy = 0.0022, w, = 0.0042 and w,=0.0131. 


3.12 Learning rule for a two-layer continuous Perceptron 


We consider first the situation with two neurons in the first layer and one neuron 
in the second layer (see Figure 3.43). The adaptation rule for this simple case is almost 
the same as for a general two-layer network with n, neurons in the first and n, 
neurons in the second layer. We will come back to it later. 

For the output y, for the ncuron in the second layer we have: 


¥3=Sa(s3) 


with s3 =Ww3,y) +W32)2+W5, the weighted input of neuron 3. 





Figure 3.43 A simple two-layer continuous Perceptron 
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For the output y, and y, we have respectively: 
yı = fi(sı) and y2 = fr(s2) 


with s, =w,,X,;+W,2X2+Wyo and s;=W,X, + W22X2 + W20- 
The function g,(x)=y, realized by the neural net is: 


GulX) = V3 = S30 Wa Si (Wir X1 + Wy 2X2 + Wyo} + War fo{WaiX1 + W22X2 + W20} +W30] 4 


Assume we want to realize with the two-layer Perceptron of Figure 3.42 for input 
vector x; a target value t,(x;) for the output of neuron 3 with Xi= [X;;, Xj2JeU, with 
U a finite subset of R?, i.e. [x;, t(x;)] is an element of the given data set D. 

For the error function we have: 


E= [t(x —ys(x)] 





A ðE 
.= — €— 
Y ôw; 
Thus the adaptation of w3ọ becomes: 
OE df; 
> =e [t3(x;)— y3(x,)] — 
Ana Pawo 3, i ; ds; 


In the same way we obtain for the adaptation of w3, and wy): 








d 
Aw3,;=—€ = =£ 2 Er] Ey 
W31 xeU 53 
d 
Aw3,=—€ HE TE > [tx — ye] $9204) 
W32 xeU S3 


The adaptation of wy will be: 


dfs dh 


W317 


ds, ` ds, 





E =e }, [f(x)-y(x)] 


AWwio= -êz 
Wio xEU 


In the same way we obtain for the adaptation of w,, and w,,: 








E df; df 
=— = j— J] — —x; 
Aw,,;=—€ a ei [t3(x,) ys ds, 1 
ôE df, df; 
} = — = aJ — 2 — =— x; 
Aw,2= A E x [¢3(x,) ytd, elas 2 


w : 
hort 
ft 
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The adaptations of wzo, w2, and w3, are: 














OE df. df: 
Awy9= —~e6—— =€ } [3 (x; x)]— z 
OW È 3(X;) — yal M195, "32ds, 
8 df, d 
Aw,,=—e-—1=e x t3(x,) — y3(x,)] Lis, fey 
Wat x,eU ds, ds, 
Awa iE =e F Lra- ysl) wy a, 
OW22 veU "ds, ads, K 


. We can simplify the expressions above and at the same time gain some intuitive 
insight into the adaptation rule by introducing the concept of the weighted output error 
ò(x;) for some neuron i for a given input x;. 

For the output neuron 3 we define the weighted output error as: 


S(x) = t(x) — y3(X)) 
For neuron | we define the weighted output error as: 


iis Ew, 


S3 
This weighted output error can be considered as the difference between the unknown 
(non-constant) target t ,(x;) value the output of neuron l and its actual output y,(x;), Le 
i) Le. 


ò (x)= (x) — y4(x,) 
For neuron 2 we define the wcighted output error as: 


d(x) = TARET 
ds, 


This weighted output error of ncuron 2 can again be considered as the difference 


between the unknown (non-constant) target £,(x,) value of the output of neuron 2 
and its actual output y,(x;), ic. 


52(X,)= £5(x,)— y2(x,) 


We sce that the weighted output error for the neurons in the first layer can be 
calculated from the error in the output layer (the error is back-propagated) and given 
the Output error of some neuron we can adapt the input weights for any neuron j 
in the same way because we can rewrite the adaptation rule for the kth weight of 
neuron j now in a general adaptation rule: 

Awy=e > Bigs, 
xeU ds; 
with z; the kth input of neuron k. 
The adaptation rule is in accordance with our intuitive ideas about adaptation. If, 
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for instance, the target output 13(x,) of the output neuron 3 is greater than the actual 
output y3(x,), then one should increase the input weight w3, of the output neuron if 


the corresponding input y,(x,) is positive (which will be always the case for a sigmoid ` 
transfer function), and decrease the weight if y,(x,) is negative — and this is what is ` 


prescribed globally by the adaptation rule: 


Awy =E È, In- y Ey ; 
xeU ds, 
because df /ds is always positive (for a sigmoid transfer function). A similar explanation 
holds in case that the target is smaller than the actual output. The same 
reasoning holds for the adaptation of the weights in the first layer. 
‘For the weighted error assigned to a neuron in the first layer one should intuitively 
reason as follows. If the target value t,(x;) is greater than the actual output y3(x,) 
and the weight w3, is positive, then one should increase the output y,(x;) to obtain 
a better result, because y3(x,) is a monotonically increasing function of w3,y,(x). 
Thus the error assigned to the output of a neuron | must be proportional to the 
product {13(x,)— y3(x,)}w3,, and because df /ds is always positive for a sigmoid transfer 
function. This is prescribed by the formula given above for the weighted output error 
of the neurons in the first layer. 


Example 3.12 


We consider a classification problem and use the single threshold labelling method. 
(We will see in the next section that the threshold labelling method gives a bad 
performance.) 

The initial weights of the neurons are: 


W19 =0.0 w,,=0.1 W,,=0.1 
W29=0.0 W= — 0.05 W22>= —0.1 
W39= —0.2 wy,=—1 w3,=1 


All transfer functions are sigmoid functions: f(s)=(1 +e7‘)7}, Assume the learning 
data sets are: 


D,={A,=(0, 10), A, =(0, ~10)} with target t(x,)>0.5 
D,={B,=(10, 0), By=(—10, 0)} with target ¢(x,)<0.5 


In Figure 3.44 these data points are given together with the separating lines of the 
first and second input neurons. Looking in the direction of the arrows of these 
separating lines, the output of the neuron will be y, >0.5 and y, >0.5 to the right of 
these lines (the weight vectors are pointing in that direction) and <0.5 on the other 
side. It will be clear that with these initial orientations of the separating lines, 
discrimination between points of data sets D, and Dy is impossible. 
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input space of neurons 1 and 2 


Figure 3.44 Data points and separating lines (s=0) of the first and second 
neuron in the first layer of Example 3.12 


In Table 3.3 the values of s,(x,), s(x), 55(x,), V(X), Y(X), y(x), dfı/ds, dfz/ds, 
df3/ds; and r3(x,)—y,(x;) for the inputs of the data set are given. 


For cach input vector x, there is a weighted input vector s; obtained after the 
following transformation: 


Peale ma fee Jafe] 
S12 W21 W22J LXi2 W20 
In Figure 3.45 the data points after this transformation (in this case a linear 


transformation because w, 9 and W 2 äre zero) are represented in the first-layer weighted 
input space S,. The weighted inputs s; are subsequently transformed by a non-linear 


mapping: 
| — be 
Yiz SSi) 

In Figure 3.46 the data points, after this non-linear scalar mapping, are represented 
in the first-layer output space Y,. In Figure 3.46 the separating line of the output 
neuron is also given. 

From Figure 3.46 it will be clear that the transformed data points cannot be 
classified correctly by any separating line of the output neuron. The preceding 


transformation mentioned above must be changed first by changing the weights Wios 
Wiis Wiz, Wao, Wz, and Wy). 
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Figure 3.45 The first-layer weighted input space (s,, sz) of Example 3.12 
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Figure 3.46 The first-layer output space (y,, y2) of Example 3.12 
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According to the formulas for adaptation of the weights of neuron 3, the weight 
vector of neuron 3 becomes (see also Table 3.3): 


0.0264 a 


AWw30 1.000 1.000 
Aw3, | =8(0.16)(0.224)} 0.731 |+ e( —0.038)(0.249)| 0.269 | = £| 0.0237 p iien 
Aw32 0.269 0.622 0.0156 ‘ 


According to the formulas for adaptation of the weights of neuron I, the weight 
vector of neuron 1 becomes: 


Awio 1 1 
Aw, , | =e(0.16)(0.224)(— 1)(0.197)] O | + e( —0.038)(0.249)( — 1)(0.197)| — 10 
Aw, 10 0 
Thus: 
Awio —0.0052 
Aw,, | =e} —0.0186 
Awy, — 0.7060 
In the same way we will find for the adaptation of the weights of neuron 2: 
Aw20 l 1 
Aw, | =e(0.16)(0.224)(1)(0.197)| O | + e( —0.038)(0.249)(1)(0.197)| — 10 
Aw22 10 0 
Thus: 
Aw3o 0.0052 
Aw, | =e} 0.0186 
Aw) 0.0706 


After choosing some value of e(e.g. ¢=0.5) we can adapt the set of weights and 
repeat the whole procedure. After a great number of learning steps (e.g. 125) the MSE 
will become zero and one would expect the classification to be correct. 

However, with the single threshold labelling method, a zero value for the MSE 
will not guarantee a correct classification. If at the final step of learning y, is a little 
bit smaller than 0.5 for the inputs A, and A), and a little bit greater than 0.5 for the 
inputs B, and B,, then all inputs are wrongly classified, whereas the MSE is almost 
zero. For a correct classification one has to use the one-zero labelling method or 
the method of double threshold labelling by setting the target for the set D, equal 
to, for example, 0.95 and for the set Dg equal to 0.05. r 


A general two-layer Perceptron may have n, neurons in the first layer and n, 
neurons in the second layer (see Figure 3.47). For the MSE for the finite set U of 
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Figure 3.47 The general two-layer continuous Perceptron 


input vectors of some given finite learning set L we have: 


À 3 {t2Ax))— Yo fx)}? 


Uj=l 


According to Theorem 3.2 the learning implies for the adaptation of the weights 
connecting neuron k in the first layer to neuron j in the second layer: 


Aw2jn=6 D, {ta âx) y2) ai 


x,eU ds2; i 
If we represent all the outputs of the n, neurons in the first layer including the 
constant component y,)=1 with a vector y,, then we can write, with the use of the 


extended weight vector W2;=[W2.9. Waj15-- <» W3j,,]', for the adaptation of the weight 
vector W.,, of output neuron j: 


dig 


A, =e) {e {nx =y (x 
ds, 


xEU j 


With the notation Ò (X) = t2AX;)— y2X;) for the error of the output neuron j for 
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an input x;, we can thus write the following theorem: 


Theorem 3.7 


Perceptron, in order to minimixe the MSE for a given set of n, target values tafx) 


` for a given finite set of input vectors U, is: 


d 
AÑ =E $. 524X,) hi, 
xeU $2; 


For the adaptation of the weight Wig m connecting the net input x,, with neuron 
k in the first layer, we obtain according to Theorem 3.2: 


daj y Afi 


ah T 
ds; aa 


AWikm™=E ÈL 2 {t2{xi)—y2Ax)} 
xeU j=l 
The sum of products over j can be considered as the error (from the output neurons 
back-propagated) of neuron k in the first layer. So with: 


ny d : 
ô, (X) = È ETEA EET 
j=1 d 2j 
we can write: 
AWikm=E > MELTA 
xeU ds 1x 


If we represent all the inputs of the network, including the constant component 
Xo = l, with a vector &, then we can write, with the use of the extended weight vector 
Win =LWik.0> Wikies) Mikal’ the following theorem for the adaptation of the weight 
vector W,, of input neuron k: 


Theorem 3.8 


The adaptation of the weight vector W of a neuron in the first layer of a two-layer 
Perceptron in order to minimize the MSE for a given set of n, target values t, {x;) 
for a given finite set U of input vectors, is: 


dfix. 
AÑ, =E } d4,(x;) iig, : 


xeU Sik 


We define the internal learning rate for input vector x; for the weights of neuron 
k in the first layer as: 





wen 


The adaptation of the weight vector w of a neuron in the second layer of a two- layer" 
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With this definition we obtain for the adaptation of the weights of neuron k in the 
first layer due to input x;: 


1 i ` i i we 
The sum of weighted — by the scalars y,,(x,) - extended input vectors x; will be 
denoted by x: 


Kae 5, PrilX)rX; 


xoU 
Thus we can write for the adaptation: 
AÑ =e 


We see that after adaptation the weight vector of neuron k in the first layer will turn 
in the direction of the weighted sum xX of input vectors X;. This implies that the 
separating hyperplane w,,'X =0 in the extended input space, as well as the hyperplane 
wx=0 in the non-cxtended input space, of neuron k will turn in a direction 
‘perpendicular’ to the weighted sum of input vectors. For the two-dimensional case 
see Figure 3.48. [NB: the weighted input s,,(x,) of neuron k in the first layer is zero 
for an input x; on that separating hyperplane.] 











(w,= 0.1, w,= 0.1, w,= 0.1) "x-0 


Figure 3.48 The input/weight space of a first-layer neuron with the 
adaptation Aw in the direction of the weighted sum x of input 
vectors x; 
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The distance from the origin to the separating hyperplane is given by: 


Because of the adaptation of w and of the threshold Aw,, 9 == 7 4(x,), there will be 
at the same time a translation of the separating hyperplane. 


3.13 Under-fitting and over-fitting of a data set with a 
two-layer continuous Perceptron 

A two-layer continuous Perceptron is frequently used to find a functional relationship 
behind a data set D of examples of pairs of arguments and function values: 
D={ (xy, (X4)), ((X2, (X2) -+3 (Xm AXm))}. However if we use a two-layer continuous 
Perceptron to approximate a given data sct, then after learning, cach output ncuron 
j of the neural net will realize some function g,(x) that is restricted to the class of 
functions realizable by that net. The function g,(x) depends on the transfer functions 
of the neurons and the number of neurons in the first layer. So if one is going to 
accept the outcome of the neural net, one is a priori assuming that the functional 
relationship belongs to the restricted class of functions realizable by that particular . 
configuration of the neural net. : 

If we use, for example, a neural net with two first-layer neurons and one output 
neuron and all neurons have a sigmoid transfer function (see Figure 3.49), then the 
class of functions with one argument value is restricted to types of the form given 
by y2, in Figures 3.50-3.52. 

The weights for the neurons for the function yz, in Figure 3.50 are wyy9= —4, 
Wirr=l, Wi20=4 Wi21=l, W210=0, W211=1 and w,,,=1. The weights for the 


neurons for the function y,, in Figure 3.51 are w,,9=8, W111 =—2, Wy20= —8, 
Wi21= —2, W219 =0, W21,=1 and w2,,=1. The weights for the neurons for the 
function y,, in Figure 3.52 are Wyy9=8, Wii: =2, Wi29 = —8, Wi21 =2, W210 =9, 
W2;,=1 and w2,,=—I. 


Although we must conclude that the number of functions realizable by a two-layer 
net with a given number of first-layer neurons is restricted, we will, however, see in 
the next section that any continuous function can be approximated within a finite 
domain, up to any given accuracy, if we use a sufficient number of first-layer neurons. 


Example 3.13 


Assume we have a data set D={—1, —0.5,0.0,0.5,1} with targets (—1)=0, 
t(—0.5)=1, 1(0)=0.5, ¢(0.5)=0 and t(1)=1 generated by the unknown function 
t(x)=0.5—1.5x+2x°. We want to find, with a two-layer Perceptron with one output 
neuron and two first-layer neurons, the unknown function behind the data set D. 
The examples of the data set are repeatedly presented to the learning neural net until 
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Examples of single-neuron output x, input 


Figure 3.49 Reversed, translated and standard sigmoid functions 


a minimum is reached for the MSE. Using the learning rules discussed before, we 
will find, depending on the initial random distribution of weights, three different 
functions g,(x) realized by the neural net. The functions are presented in Figures 
3.53-3.55 by the dashed lines. The solid line presents the function t(x)=0.5 — 1.5x +2x?. 

We see that we cannot find, even with more data points, the unknown function 
with the selected configuration of the neural net. a 


We found that if the number of first-layer neurons is too small, we cannot realize 
the approximate fitting of data points. On the other hand, we can choose too many 
neurons in the first layer. With a great number of neurons in the first layer the 
function realized by the neural net will go exactly through the data points (i.c. the 
MSE will be zero) but will fluctuate wildly in the intervals between the data points. 
We say the data points are over-fitted. 


Example 3.14 


We take the same data set as in Example 3.13: D={—1, —0.5, 0.0, 0.5, 1} with targets 
t((—1)=0, {—0.5)=1, 1(0)=0.5, 1(0.5)=0 and «(1)=1 generated by the unknown 
function t(x)=0.5—1.5x+2x°. Again we want to find with a two-layer Perceptron 
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Figure 3.50 Example of the output yz, of a continuous Perceptron with two 
neurons in the first layer and one neuron in the second layer 
for a one-dimensional input f 


the unknown function behind the data set D. We take the number of first-layer 
neurons equal to the number of elements in the data set, i.e. five. 

We will show that there exists a selection of weights such that the realized function 
fits almost exactly the data points. The MSE will then be zero. 

For the fifth neuron in the first layer we select w5ọ= —75 and ws, = 100. This 
implies that the output y, of that neuron is almost equal to | for the fifth data point, 
1, and will be zero for the other data points. For the fourth neuron we select w49 = — 25 
and w,,=100. This implies that the output y, of that neuron is almost equal to | 
for the fourth and fifth data points, 0.5 and 1, and will be zero for the other data 
points. In the same way we select w39=25, w3, = 100, Ww.) =75, w2, = 100, wy) = 125 
and w,,=100. In Table 3.4 the outputs for the first-layer neurons and the target 
value of the neuron in the second layer are given. 

The outputs of the first layer are the inputs for the neuron in the second layer. If 
we select the weights for the output neuron in the second layer as W,) =0, w3, = — 100, 
W22 = 200, w,; = — 100, w,, = — 100 and w,, = 200, then the function (see Figure 3.56) 
produced by the neural network will go through the data points and the MSE will 
be zero but the function will fluctuate wildly in the interval between the data points. 
We can use the same method even if the number of data points is very large. E 
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Output of 2,1 network 


Example of the output yz, of a continuous Perceptron with two 
neurons in the first layer and one neuron in the second layer. 


Figure 3.52 
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From the discussion above we can conclude the following: 


ent 3.10 


Practical statem 


If we want to infer from a given data set the unknown functional relationship and 
select the number of first-layer neurons too low, then the unknown function will be 


under-fitted (will not go through all d 








if we select the number of neurons 


+ 


a points); 


aul 


too high, then the unknown function will be over-fitted (the realized function will go 


through the data points but will fluctuate wildly in between). 


layer Perceptron. 


Figure 3.53 Function learned (dashed line) with a two 
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Figure 3.56 Function that can be realized by a continuous Perceptron with 


Figure 3.54 Function learned (dashed line) with a two-layer Perceptron. 
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Figure 3.55 Function learned (dashed line) with a two-layer Perceptron. 
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3.14 The class of functions realizable with a two-layer 
Perceptron 


In the previous section we founid Ahgt the function realized with a two-layer continuous 
Perceptron will not go through 'the data point of the learning set if the number of 
first-layer neurons is too small; if, however, the number of first-layer neurons is too 
large, the function will go through the data point but will wildly fluctuate in between. 


In this section we will show that a two-layer continuous Perceptron with a sufficient 
number of neurons in the first layer can approximate any continuous function 
arbitrarily well. This implies that there is no theoretical argument to use a Perceptron 
with more than two layers to identify some unknown function. We will, however, see 
in Section 3.15 that in some cases it might be more profitable to use a three-layer 
Perceptron because the total number of neurons may be less. 

The basic idea behind the statement above is that we can approximate any 
continuous function with a Taylor scrics, and on the other hand we can also 
approximate the function realized by a two-layer Perceptron with a Taylor series. 
We will see that we can modify the coefficients of the Taylor series of the Perceptron 
by selecting appropriate weights in such a way that the coefficients of the Taylor 
series of the function realized by the Perceptron will be equal to the coefficients of 
the Taylor series of the given continuous function. We consider first the 
one-dimensional case. 

Let f(x) be a continuous function from R to R, which is continuously differentiable 
round a certain point xg. There then exists a unique series of reals a; (i=0, 1, 2,...) 
for which: 


n 


J= È afx—xo) +R, 


i=0 


with a;= f (xo}/i! with f(xo)! the ith derivative of f in x9, and R, the remainder term 
with lim, R,=9. 

According to Taylor's theorem, for every function f and for every domain [X9, x, ] 
and for every e there exists an n such that: 


n 


max |f(x)- $ a{xo—x)'|] <e 
XE[Xo. Xi) i=0 


In other words we can approximate any function f in any domain [x9, x, ] arbitrarily 
well with the series expansion. 

The same holds for the p-dimensional case with f a function from R?’-+R. The 
Taylor series in that case will be: 


fx) = A aix- Xo VIF (Xo) +R, 
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1 


with: 
[(x — xo) V]? f (Xo) = f (Xo) 


ĉ 
+(x -xo (x0) 
p 


[(x— Xo) VFS (Xo) = [x — Xo) VIL — Xo) VK"! S (xo) s 


and lim,_.,, R ,=0. 

We now return to the one-dimensional case and consider the function gẹ(x) from 
R to R, realized by a continuous Perceptron with m neurons in the first layer. Each 
neuron i in the first layer has a sigmoid transfer function: 


fa) 
[(x—xo)VI Soo o) 
xy 


Juls] = [i +exp(wrio + Wiitx)] 7 


with wi;ọ the weight of the ith first-layer neuron connected to the constant input 
Xo=1, and w,;, the weight of the ith first-layer neuron connected to input x, =x (see 
Figure 3.57). 

We take one output neuron with a linear transfer function: 


fi(s)=s 


with s=W)+Lwy2,f;; the weighted input of the output neuron. (We could take a 
non-linear transfer function as well, but it would only complicate our discussion.) 
The function g@(x) realized by the neural net will have the following form: 


GMX)=Wrot Ð WahidWriot Wi) 
i=l 
The Taylor expansion of this function round a point Xo will be: 


n 


galx)= È a(x—Xo) +R, 


with 
ao = GwlXo) 
l d 
ay =i o) 
=i E ao 
etc. 


with g™(xXo)=Wa9 + È wz: fi, s(Xo)]; in short g= wzo + E wz; fi; and df /ds= f(1— f), 
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and we obtain: 


m 
ao=Wz0+ È wah 
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. d*f,j dsj 


pad ul 
a= ae Wai ds?, dx = = Writ Sim 3S tit hia 





etc. 


Given for some n the Taylor approximation of a given function: 


Ms 


f[(x)= a{x—Xxo)'+R, 


o 


ui 


we can select the weights of the Perceptron w,,, Wiio and wa for i=1,2,...,m such 
that a;=a’ for j=0, 1, 2,..., 2. 

If the number m of first-layer neurons is equal to the number of n+1 Taylor 
coefficients minus one, i.e. m=n, then we can take the weights w,;9 and w,,; of the 
first-layer neurons almost arbitrarily, because we can still select the m+1 weights 
Woo. W21s::-s W2m Of the output neuron such that for the n+1 Taylor coefficients 
we have a;=a', for j=0, 1, 2,..., n. 

We can put the requirement a;=aj and the expressions for aj in a matrix form: 


a=Aw, 
For n=3 and m=2 we obtain: 
ao l fia Sia W9 
a, |=|0 Siasi DW (Sias SiW | Waa 


az 0 ia 3i +f Wi (Sia 3S iat 2f iawii W22 


The weights in the connections from the first layer to the output neuron can be 
found from: 


w, =4A`'a 


The matrix A can always be made non-singular by choosing suitable values of the 
weights Wiio and w,,, for i=1,2,...,m. For a certain xg and w,,,, the value of fi; 
occurring in matrix A, can be assigned any value between 0 and | by choosing w, jo. 
Thus we can always make det A #0. 

The conclusion is that we can approximate any continuous function f: R>R in 
any domain [Xo, x, ] arbitrarily well by a two-layer Perceptron with n input neurons 
and one linear output neuron if the function f is approximated by the first n+1 
Taylor coefficients. The same holds for the p-dimensional case. The proof is similar 
and is based on the introduction of the n(p— 1) additional weights w 


tip Which we can 
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freely select. Note that for a close approximation the number of required input neurons 


can become quite large. For a full proof see Tromp (1993). Thus we have the following. 
_ theorem: 


Theorem 3.9 


‘A two-layer continuous Perceptron with sigmoid transfer function for the neurens 


in the first layer and one linear neuron in the second layer can approximate any 
continuous function f: R"—>R in any domain with any given accuracy. 


Although it is important to know that a two-layer Perceptron can approximate 
any continuous function and that we can calculate the required weights with the 
method described above, the weights obtained after learning with the descending 
gradient method from samples of the function will not be the same as the weights 
found with the method described above. With the Taylor series expansion the realized 
function will approximate the (known!) function very closely in the neighbourhood 
of the point x9 used in the expansion, whereas the realized function obtained: after 
learning will approximate the (unknown!) function over the entire domain interval 
of applied samples such that the MSE becomes as small as possible. 


Example 3.15 


Let the function we want to approximate be f(x)= 2x —x?. With the method described 
above we want to calculate the weights of a two-layer Perceptron. We approximate 
the function with the first three Taylor coefficients, so we need a Perceptron with 
two neurons in the first layer. We choose to approximate f(x) around x 9=1. The 
first three Taylor coefficients are aọ= 1, a,=0 and a,=—1. 

We select the weights of the two first-layer neurons arbitrarily (we have to check 
that the matrix A is not singular): 


W119 = 0.3 
W14,=0.1 
W129 = 0.4 
W121 =0.2 

For xọ= 1 the values of the outputs of the first-layer neurons become: 
fi, = 9.690 
Ji. =9.731 

The matrix: 

l fia Sir 


A=|0 (fis fiÐWa (fiaz SiW 
0 (fir — Bf Ta tfi DW (fio —3fi2t 2h 2)" i21 


ey Ý l 
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becomes 


ci 0.690 0.731 
A=]@Q 0.064 0.188 
Lo! —0.004 0016 
For the weights connected with the output neuron we find with w=A~!'a: 
Woo= 56.788 
Wp, = — 190.867 


W22= 103.830 





y -axis 


The function gy(x) realized by the Perceptron with the weights given above is 
shown in Figure 3.58 together with the function f(x). 

The same configuration of the network was used to learn the function with the 
back-propagation gradient descent learning rule. The inputs were randomly chosen 
with a uniform distribution between —1 and 3. The initial weights were chosen at 
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x-axis 


Figure 3.59 The function realized (upper curve) with a two-layer continuous 
Perceptron with two first-layer neurons and one linear output 
neuron learned with back-propagation with sampled data from 
the function y=2x —x? 


random between —0.1 and 0.1. After 10000 epochs (4.56 sec) the function had the 
form as shown in Figure 3.59. a 


y -axis 


3.15 The three-layer continuous Perceptron 


Although a two-layer Perceptron can approximate any continuous function arbitrarily 
well, it may be profitable sometimes to use a three-layer Perceptron because the 
required number of neurons may be less, especially when the function to be 
approximated is expected to contain discontinuities. 








Ann Aen Enea ean need menemene We will show, for the one-dimensional case, that if a continuous or discontinuous 
-1 -0.5 0 0.5 1 1.5 2 2.5 3 function can be approximated by a piecewise linear function, it can be approximated 
x-axis by a three-layer Perceptron. 


Assume we have a function g(x) from R to R with function values between zero 
Perceptron with two first-layer neurons and one linear output and one. We will see that this latter restriction will be eliminated. We divide the 
neuron and calculated weights with the Taylor approximation domain intervals [xi X+ ı] with k=0, 1,...,N, such that [y(Xe+ D= GV O41 =x) 
of y=2x—x? is finite, and in every interval we replace g(x) by a linear function that goes through 


Figure 3.58 The function realized (upper curve) with a two-layer continuous 
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X Xz X3 Xk Xk 
Figure 3.60 Piecewise lincar approximation of some function g(x) 


g(x,) and g(x,4,). In this way we obtain a piecewise linear function g'(x) that 
approximates g(x) (see Figure 3.60). The length of the interval x, +, — x, will be denoted 
by A,. The approximation can be improved by making the interval A, smaller. We 
will denote the approximating function in the domain [x,, X,+1] by gi(x), and so we 
can write: 


N 
g (x)= È} gilx) 
k=1 


We first show that the functions gi(x} with k=1,2,..., N can be realized by a 
two-layer net with four neurons in the first layer and one neuron in the second layer. 
We will see that the sum È can be realized by one linear neuron in the third layer. 

To realize a function g,(x) (see Figure 3.61) we use the network configuration of 
Figure 3.62. All neurons have a sigmoid transfer function f. The first neuron realizes 
a decreasing step function at the point x, (see Figure 3.63). We select w,,9 and w13; 
such that —w,,9/w,,;=%, and w,,, is given a large negative value, e.g. w,,, = — 100. 
Thus the output y,, of the first neuron in the first layer will be equal to | for x<x, 
and will be 0 for x >x,. The output is connected by a large negative weight w,, (e.g. 
— 100) to the neuron in the second layer such that the output neuron in the second 
layer will always be 0 for x <x,. The second neuron in the first layer realizes a rising 
step function (see Figure 3.63a). We select its weights such that —wy39/W121 =Xk+1 
and we make w,,, large and positive, e.g. w,2,= 100. Thus the output of neuron 2 
in the first layer will O for x<x,,,, and will be 1 for x>x,4,. The output of 
this second neuron is connected by a large negative weight wz, to the neuron in the 
second layer such that the output neuron will always be zero for x>X,41.- 

The third neuron is used to realize the correct slope of g(x) (see Figure 3.63b). 
We select wy 55 and wy, such that — Wy 39/Wy 3, =(Xk+1 + %,)/2=% 
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Figure 3.62 The network configuration to realize glx) 


With the fourth neuron we can adjust the desired level of g,(x) (see Figure 3.63c). 
We select W149 and w,4, such that —Wy40/Wi41 < Xx and W,4, has a large value. Thus 
neuron 4 realizes a rising step function at x= —Wy40/W14, and hence y,4=1 in the 
domain [x,, X,+1]- 

For the output y,, of the neuron in the second layer we obtain: 


Y21 = Soi 20 + W21Y11 t+ W22Vi2 t W23Y13 + Wa) ia) 


We select wọ =0. Within the domain [x,, %,+1], y11 and y,2 are constant O and yi4 
is constant 1; thus with yi3 = fi3(W130 +W131X) we obtain: 


Yor = fni [W23f13( W130 +W131X)+ W24] 
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1 














0 


Figure 3.63 (a) The decreasing step function realized by the first neuron and 
increasing step function realized by the second neuron. (b) The 
output of the third neuron realizing the slope of g(x). 
= output of the fourth neuron to adjust the level of 
GX 


If A is small enough we can approximate y,, around X¥=(x,+X,4,)/2 by: 


dy, 
Yau) = ya) 22! 
dx 





(x — x) 


At X= —Wy30/W13, we have fia(Wi3o +W,3,x)=0.5, thus: 


Yo) = fo (230.54 w34) 


Because y,,(%)=9,(%)=[y(x,) +. g(x i 
é : G(X) + y(Xx + 1)]/2 we obtain for the values "23% : 
oo ne dD e values of w; and w34 


Salg) = 0.502540, 
For the derivative of the function realized by the neural net at £, we have: 


Cyan yas dfs, ds3, df, 3 ds, 3 
dx ds,,dy,3d5,, dx 
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With df31/ds21 = faal — fais S21 =W23Y13 W24 df,3/ds,3=0.25 and s,3=W131% 
we obtain: . 


dy2, 
X 





(x)= fal — f21)W230.25w131 


Because we require that y2; = f21(¥)= g5) and 


dyz; = dg; 
dx dx 


we obtain for the values of w,, and w43, the requirement: 


dg, = fom 
= = 0.25g L — G23 191 


with dg,/dx = [g(x +1) —y(x,)V/A and gi(X)= Foxy) + Yk + 1)]/2- 
In conclusion we have for the values of the weights of neuron 3 and for w23 the 
requirements: 
— Wy30/Wi31 =(Xk+1 +X4)/2= 7 
4dg,/dx 
gS- g) 


W23W131 5 


For the values of the weights of the neuron in the second layer we have the 
requirements: 


W29=0 
wa, <—10 
w2< — 10 


0.5wz3 + w24 = S21 (gil) 
Example 3.16 


Let g;(x) be as shown in Figure 3.64 with x, = 5, Xp + 1 = 6, g(x) = 0.87 and g(x, 

We obtain x=5.5 and g(x)=0.8. From Figure 3.24 we find f~'(gi(X)) 
the derivative dgj/dx = [g(x +1)— g(x;)]/A we obtain dg,/dx = — 0.14. 

For the first neuron in the first layer we select w11 = —200. The requirement 
—Wy10/111 = Xx is satisfied by W110 = 1000. 

For the second neuron in the first layer we select w12; = 200. The requirement 
—Wy 20/121 = Xk+1 ÍS satisfied by w,29= — 1200. 

For the moment we skip the third neuron in the first layer. 

For the fourth neuron in the first layer we have the requirements —Wy40/Wia1 < Xk 
and w,,4, large and positive. We select W149 = — 400 and w,4, = 100. 


+1) =0.73. 
= 1.5. For 
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Figure 3.64 The function g(x) of Example 3.16 to be produced by the 
network configuration of Figure 3.62 


For the neuron in the output layer we have the requirement: 


W329 =0 
W2,<—10 
W22< —10 


0.5w25 + W4= f3 (gi(%)) = 1.5 


We select w2)=0, w2, = — 100, w= — 100, w,,=4 and w,,=—0.5. 
Finally we have for the requirements of the third neuron in the first layer: 


—Wiz3o/Wi31 =(Xk+1 + X,)/2=X=5.5 


4dy,/dx 4(—0.14) 
WaWa =e aa n Aa A 
IAI —gyKL3)] 0.8(1 —0.8) 
Because w,,=4 we obtain w,,, = — 0.875, and because of the first requirement we 


obtain w, 349 =4.8125. 


A simulation of the neural net with the weights selected will give the function as 
shown in Figure 3.65. a 


The original function g(x) was assumed to be approximated by a piecewise function 
g'(x), which was constructed out of a sum of functions g;(x): 


N 
gix)= Yo gilx) 
k=1 
The summation can be realized by one linear neuron in a third layer: 


N 
P(X) = Wygt È W39iX) 
k=1 


If the minimum value of original function g(x) to be approximated is Imin and its 
maximum value is Ymax WE select Wyo jin and for all k we select w3, = Ymar —Ymin 
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Figure 3.65 The function produced by the network of Figure 3.62 with the 
weights calculated in Example 3.16 


We leave the proof for the case of a n-dimensional input to the reader. Our 
conclusion is the following theorem: 


Theorem 3.10 


Every function that can be approximated arbitrarily well by a piecewise linear function 
can be realized by three-layer continuous Perceptron with one linear neuron in the 
output layer. 


3.16 Application of a two-layer continuous Perceptron to 
function identification 


We found in Section 3.13 that if we want to identify, with a continuous two-layer 
Perceptron, some function from samples of that function, the number of first-layer 
neurons is critical. If we select the number to be too small, the function will be 
under-fitted, i.e. the function realized after learning will not go through the samples 
of the data set. If we select the number of first-layer neuron to be too large, the 
function will be over-fitted, i.e. the realized function will go through the samples of 
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the data set but will fluctuate wildly in between. The last phenomenon will be observed 
when we determine the MSE for a test set much larger than the learning set. 

We performed experiments. with a learning set of 225 equally spaced samples from 
the domain [—1, 1]*[— 1:4 Hof the non-linear two-dimensional function (see 
Figure 3.66): SOR , 
glxi, X2)=0.125 +0.125x, +0.375x, x, +0.125x2 


In the first experiment we used a continuous Perceptron with two neurons in the 
first layer and one neuron in the second layer. All neurons had a sigmoid transfer 
function. After training we determined the MSE for 20000 points uniformly chosen 
from the domain. We found'a MSE of 0.01411. 

In a second experiment we used four neurons in the first layer. The MSE for the 
same test set turned out to be considerably smaller: MSE =0.005 52. In the next 
experiment we used eight neurons in the first layer. After learning we found 
MSE = 0.006 11. Apparently the function was over-fitted. In a final experiment we 
used sixteen neurons in the first layer. The MSE turned out to be 0.007 56. 


From these experiments we conclude that with respect to the MSE, under-fitting 
is much worse than over-fitting. 


3.17 Application of a two-layer Perceptron to the mushroom 
classification problem 


In Section 3.1 we promised to show that a continuous Perceptron can learn to classify 
two classes of mushrooms from samples. 
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Figure 3.66 The function g(x,.*,)=0.125+40.125x, +0.375x,%2+0.125x3 
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Figure 3.67 Data of the two-dimensional mushroom classification problem 


We have a learning set consisting of a collection A of mushrooms that are very 
good medicine for some specific illness and a class B of poisonous mushrooms. The 
mushrooms of the two classes differ slightly in the length x, and in the thickness x, 
of the stem (see Figure 3.67). 

The fifteen pairs (x4, x2) of class A are: (7.9, 7.2), (4.0, 6.3), (6.9, 8.3), (6.4, 7.3), (6.9, 7.2), 
(0.9, 9.3), (2.6, 6.9), (2.6. 7.9), (4.9, 9.8), (4.9, 6.9), (4.9, 7.8), (6.9, 9.2), (4.0, 7.3), (4.9, 5.5) 
and (8.8, 9.3). The target value for elements of class A during learning is t=0. The 
classification criterion for the output is y<0.5. 

The nineteen points of class B are: (5.9, 2.3), (6.5, 5.8), (8.8, 8.3), (4.0, 5.4), (5.9, 6.3), 
(4.0, 2.4), (1.9, 7.3), (4.9, 3.9), (6.9, 3.5), (7.9, 5.8), (0.9, 8.6), (0.9, 7.3), (0.9, 7.9), (5.9, 4.4), 
(4.0, 3.4), (8.8, 6.4), (3.0, 3.9), (7.9, 4.8) and (2.4, 5.9). The target value for the elements 
of class B is t= 1. The classification criterion for the output of the neural net is y>0.5. 

In Section 3.1 we constructed a solution (not the best one) for the classification 
using the separating lines L, and L, of the two neurons in the first layer of a continuous 
Perceptron. All points to the right of L, and simultaneously to the left of L, belong 
to class A; the points in the remaining area are assumed to represent elements of 
class B. The wrongly classified elements for the constructed solution are represented 
by black spots in Figure 3.67. 

We performed several experiments with a continuous Perceptron with two neurons 
in the first layer and one neuron in the second layer (see Figure 3.68). All neurons 
had a sigmoid transfer function. During training we used the hyperplane boundary 
classification method with one -zero labelling. 


“ee 
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problem 





Figure 3.69 The first solution found by the neural network 


After learning, we found several different solutions depending on the initial 
distribution of weights. In four solutions, thirty-one of the thirty-four samples of the 
learning sct were correctly classified. 

For the first solution the separating lines L, and L, realized by the two neurons 
in the first layer are shown in Figure 3.69. The wrongly classified elements of the 
learning set are represented by black spots. The weights of the neurons for this 
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classification were: 


Wii9 = 5-196 Wy = — 4-791 Wy 2 = 0.902 


W120 = 14.086 w21 =0.444 Wi92 = — 2.459 
W29 = — 5.297 wa, = 6.499 w2 = 8.672 
Thus the separating lines for the neurons in the first layer for this classification are: ` 


Ly: x,=5.312x,— 5.761 
Ly: x,=0.181x, +5.728 


This means that data points in the area B, in Figure 3.69 will give an output for the 
first neuron in the first layer: y,, >0.5, and for the second neuron in the first layer: 
V2 < 0.5. 

The weighted input of the output neuron is s= — 5.297 +6.499y, , + 8.672), 9. For 
the elements of the learning set in the area B, the output y, of the output neuron 
will be >0.5 as required. For the area B, of Figure 3.69 y,,<0.5 and ¥12 >0.5. For 
the samples of the learning set in area B, this will give an output of the neuron in 
the second layer: y, >0.5 as required. In area A, of Figure 3.69 we have ¥11<0.5 
and y,,<0.5. For the elements of the data set in area A, this will give: y,<0.5 as 
required for the samples of class A. 

The separating lines of the two first-layer neurons for the other three solutions, 
with thirty-one out of thirty-four points correctly classified, are given in Figures 3.70- 
3.72. Note that the solution of Figure 3.72 corresponds with our constructed solution 
of Section 3.1. 

Another solution, with only twenty-five out of thirty-four correctly classified points, 
found by the learning process gives separating lines of the first-layer neurons defined 
by x, =0.806x, — 12.116 and x, =0.460x, +2.418. 


represented by the separating lines in Figure 3.73 was not found by learning in our 
experiments. ' 

We learn from this experiment that a classification problem can be solved in 
different ways and the neural net will give different solutions if we use a sufficient 
number of different distributions of the initial weights in the experiments. In general, 
the different solutions are difficult to predict from inspection of the data set. We also 
observe that we do not always obtain the optimal solution of the classification problem. 


3.18 Application of a two-layer Perceptron to the detection of the 
frequency of a sine wave 


Assume we are given a sine wave with a given constant amplitude A and arbitrary 
phase ¢: x(t)= A cos(2r ft + ġ) and we want to detect with a multi-layer Perceptron 


‘up Y 
i 


The optimal solution, with thirty-two out of thirty-four points correctly classified, 
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Figure 3.70 The second solution found by the neural network 


Figure 3.73 The optimal solution not found by the neural network 


e third solution found by the neural network 


Figure 3.71 Th 
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x(t) 
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Figure 3.74 Sampling the sine wave with a window 


whether the frequency f is equal to some frequency fy or not. Thus we want a neural 
net with one output neuron with, for instance, an output y>0.5 if f= fọ and y<0.5 
if not f = fo. First we have to determine how we obtain observations from the given 
sine wave. 

The neural net requires real-valued, finite-dimensional vectors as an input. For 
this purpose we can sample the given sine wave at equidistant intervals T} during a 
finite interval w. We will call the finite observation interval the window. Let there be 
n samples in the window interval w; then the components of the observation vector 

will be: |. A cos(2nf Ti+) with i=0, 1,...,7 
To obtain different observations of the sine wave, the window will be shifted over the 
sine wave to any position (see Figure 3.74). 

We have to select the length of the window and the sample interval T,. The length 
of the window must be such that if the frequency f of the sine wave differs from fo, 
then the observation vectors obtained by sampling both sine waves must also be 
different. 

Now consider the sine waves of Figure 3.75 one with frequency fọ and one with 
a frequency slightly smaller than fo. If the window is smaller than the period Ty 
of the sine wave with frequency fo, then there are positions of the window such that 
the sequence of samples in the window will be almost the same for both sine waves 
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Figure 3.75 The influence of small and large windows on detecting small 
frequency differences 


whatever the sample distance Ty might be. The observation vectors obtained by 
sampling both sine waves in the window will, however, differ more for all positions 
of the window, the longer the length of the window w is. This implies that if we want 
to detect the frequency fy with a high resolution we have to make the window length 
larger than the period To. In our experiment we selected w= (5/4)To. 

If we sample a sine wave with frequency fo with sample distance Ta, then there 
are sine waves with the same amplitude but with different frequencies that will give 
the same set of observation vectors (see Figure 3.76). 

The sample values of a sine wave with frequency f* are: 


x; =A sin(2nf* Tai) 
The samples of a sine wave with frequency fo are: 
x, =A sin(27 fo Tai) 


If Inf*Ryi=2nfoTyitk2n for all i=0,1,2,...,n and any integer k, the sample 
values of both sine waves will be the same. 
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Figure 3.76 By sampling with sample frequency fy=4fo we can not 
distinguish between sine waves with frequency fo and 5fo 


Thus the sample values will be the same if: 


ae ae 
S Sot T= htk 


with f;=1/T, the sample frequency. 

In Figure 3.76 we have fy=4/f, and thus with k=1 we obtain the same samples 
for a sine wave with frequency f* = S5fo. 

Thus after learning the response y(f) of the neural network will be periodic: 
y(f)=yf + kfa). Therefore if the sample frequency fa is too low, we will not be able 
to discriminate between fo and a frequency slightly different from fo: fo + Afo= fot fa 
In our experiment we selected f4 =12fo. With the selected window length w=(5/4)T) 
and T,=(1/12)Ty we obtain observations vectors of fifteen successive samples. 

The learning set was obtained by sampling sine waves with ten different frequencies 
around fo. Each sine wave was observed with our window at twelve equally spaced 
different sample positions, corresponding to twelve different phases between 0 and 
2n. The result was a learning set of 10x 12=120 different fifteen-dimensional 
observation vectors. 

In Table 3.5 information is given about the frequencies of the sine waves in the 
learning set and about the targets for the observation vectors obtained by sampling 
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Table 3.5 
@/Wo 0 0.3 0.6 0.9 Lt 1.4 1.7 2 2.5 3 


Target 0.1 0.1 0.5 0.9 0.9 0.5 0.1 0.1 0.1 0.1 





Ymax 
Ymin 














n 
0/0 


Figure 3.77 The maximum and minimum output of a two-layer network 
with one output neuron and four neurons in the first layer for 
sine waves with different frequencies and arbitrary phase shift 


those sine waves. The test set of observation vectors was obtained with the same 
sampling window on arbitrary sine waves. 

We performed experiments with a neural net with one output neuron and 
respectively two, three and four neurons in the first layer. Only the neural net with 
four neurons in the first layer gave a satisfactory result (see Figure 3.77). For a given 
frequency the output of the neural net varied between Ymin ANd Vmax» depending on 
the phase of the particular sine wave. 

For a particular distribution of the initial weights, and with the target 0.9 replaced 
by | and the target 0.1 replaced by 0, we obtained the very good result given in 
Figure 3.78. 


“ey 
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Figure 3.78 The maximum and minimum output of a two-layer network 
with one output neuron and four neurons in the first layer for 
sine waves with different frequencies and arbitrary phase shift. 
The targets during learning were respectively 1 and 0 


3.19 Application of a multi-layer Perceptron to machine 
condition monitoring 


It is Important not to wait until a defective mechanical machine breaks down before 
repairing it. In an ideal maintenance strategy the machine would be taken out of 
service and repaired just moments before major damage occurs. To be able to predict 
the time when a machine needs maintenance we have to know its condition at each 
moment. One technique for monitoring machine condition could be analysis of the 
lubricating oils for the presence of particles that indicate wear. Another technique is 
vibration analysts: as the condition of a machine changes, so the vibration 
characteristics also change. We can take a set of consecutive samples of the vibration 
signal at some time during a certain interval and determine with fast Fourier transform 
(FFT) the frequency spectrum of the signal at that particular time interval. By 
analyzing the frequency spectrum of the vibration signal for a short time interval at 
different moments, we can monitor the condition of the machine or its parts ‘The 
frequency spectrum during a certain time interval can be represented by a vector y 


with components equal to the coefficients of the spectrum. We will call these vectors 
spectral vectors. l 
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Figure 3.79 Illustration of the bearing system modelled in the text 


We have to divide the set of spectral vectors into classes corresponding to different 
categories of machine condition. This classification problem can be learned by a 
multi-layer Perceptron. 

In our application we had to learn the condition (damaged or not) of a ball-bearing 
in some machine (see Figure 3.79). We had at our disposal the vibration signals of 
four equal ball-bearings. One of the bearings was damaged due to an almost invisible 
small pit in the outer track. For every bearing we had a vibration signal at two 
different revolution speeds (1500 c/s and 3000 c/s) and with three different loads (no 
load, 2.5kN and 5 kN). For every bearing the vibration signal was obtained with 
sensors in three different positions: horizontal, vertical and axial. Thus the number 
of vibration signals for each bearing was 2 x 3 x 3=18. 

The vibration signal with a length of about 1 sec was sampled with a frequency 
of 48 kHz. With a window of 128 samples we move with steps of 128 samples along 
the signal. In this way we can place the window in 256 different positions. With the 
FFT applied to every observation we obtain 256 different spectral vectors of sixty-four 
components for each observation signal. 

We used a neural net with four neurons in the input layer each with sixty-four 
inputs and one output neuron with four inputs. Each neuron had a sigmoid transfer 
function. The target value of a spectral vector obtained from a vibration signal of 
the damaged bearing was 0; for the other spectral vectors the target value was 1. 

In one experiment we trained the neural net with spectral vectors obtained from 
vibration signals of bearing no. | (not damaged) and bearing no. 2 (damaged) at a 
revolution speed of 1500 c/s, with the three different loads and with the sensor in 
both vertical and horizontal positions. 

After learning, we tested the network with 256 different spectral vectors for each 
vibration signal. A spectral vector was classified as ‘damaged’ if the output of the 
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Table 3.6 


Load 


Rev. speed 


Classification 


Sensor 


(kN) 


(r.p.m.) 


Bearing 
number,train 


Vert. Axial Damaged 


Horiz. 


2.5 


0 


3000 


1500 


Not damaged 


206 
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neural net was <0.5, and as ‘not damaged’ if the output was >0.5. A part of the 
results from the experiment is given in Table 3.6. Each row in the table corresponds 
to the observations of one vibration signal. The first column gives the number of the 


bearing and whether (x) or not (-) the corresponding vibration signal was used . . 


during training. The next eight columns give the state of the bearing and the position 
of the sensor. The last two columns give the classification result of the 256 spectral 
vectors of the corresponding vibration signal. 

We observe that correct classification (by majority voting) occurs, even on those 
vibration signals not used in the training phase, and certainly if the machine conditions 


are the same for training and testing. 
‘In additional experiments, which included observations of the vibration signal of 


the damaged bearing at different revolution specds in the training sets, we were able 
to improve the classification results. 


3.20 The learning speed of a continuous multi-layer 
Perceptron 


The learning time for a continuous multi-layer Perceptron can be very long because 
the adaptation of the weights can be very small due to the small value of the derivatives 


of the MSE: 
Aw=—eVE 


The error ‘landscape’ E may be very flat in certain areas and may have steep valleys 
in other regions. In the first areas the value of the internal rate e may be chosen 
large, whereas at steep valleys and in the neighbourhood of a minimum the value of 
e must be small. 
How do we have to select £ in order to proceed rapidly through the error ‘landscape’? 
In Section 3.12 we found for the adaptation of the weights in the neural net: 
Aw = —eVE. For the adaptation of the extended weight vector of W,; of neuron j in 


layer we found: 
df, 
Aw =e F ôa ax, 
xeU dsj 
with 6,(x;) the (back-propagated) error for input x; assigned to the output of neuron 
jin layer k, and 2(x,) the input of that neuron if the input of the neural net is x;. 
We can improve learning speed by adding to the calculated value of AW(t) (with 
W(t) the vector containing all weights in the neural net) at learning step t a vector 
proportional to the calculated value AwW(t — 1) in the previous step: 


Aw*(t)= Aw(t) + xAw(t— 1) with x between 0 and t 


This method is called the momentum method, and x is called the momentum parameter. 
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A better method is the line search method. The adjustment of W(t) at step t becomes: 


Aw())= —uVE 


with u selected such that MSE at the new value of the weight becomes minimal, i.e. 
$ rA 


E[LW(t + 1)] = ELW(t)—uVE] is minimal 


A simple way to find the value of u is to increase u in fixed steps ô until E(w) no 
longer decreases. Although the line search method may take considerably fewer steps 
than the gradient descent method, we must bear in mind that each step may take 
many evaluations of the error function E. However, it turns out that the calculation 
of the error E can be simplified, though we will not deal with this subject. 

Note that at the subsequent steps the direction vectors —VE[w(t+1)] and 
~- VE[w(t)] are perpendicular, because for the optimal value of u we have: 


d 
= EO -u VELMO]] = — VEMO] VELE) -uV E[wD]] 


= VEW] V Ewe + 1)] 


The approach to the minimum is therefore a zig-zag path. A still better strategy is 
to let the new search direction be a compromise between the gradient direction and 
the previous search direction d(t). If we write: 


w(t + 1) = w(t) + ud(r) 
then the search direction d(t) becomes: 
dl) = — VET w(t)] + Bd(t — 1) 


This method is called the conjugate gradient method. 


3.21 Initialization of weights and scaling the input and output 


If we use the sigmoid transfer function for all neurons in the neural net, then the 
output values will vary between zero and one. Thus if our training and test set 


contains target values beyond these boundaries, we have to rescale the target values 
t(x,). 


In the case of a linear scaling, the scaled values become: 
I(X;) — tmin 
—t 


1*(x) = 


max min 


; RER i ; 
with Lmin = Min; (AX); and taas = Max {x}. If we have a linear output neuron, no 
scaling of the targets is required. 

From a theoretical point of view the scaling of the input vectors is not necessary, 
because the input of a neuron is not required to be bounded. However, large input 
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Figure 3.80 Example of the initial distribution of six initial non-extended 
weight vectors over the two-dimensional weight space ` 


values together with large weights may result in large values of the weighted input 
s(x) =£ w;x;j and the adaptation of weights may become almost zero because the 
derivative df /ds, occurring in the adaptation rule, is almost zero for large values of 
the weighted input s(x,). Therefore the scaling of inputs depends on the value of the 
weights. 

The initial weights may randomly be selected from any interval of real values. 
However, with a random selection of weights we may end up in a local minimum of 
the error function E, and we may then have to repeat the learning process many 
times with different initializations in order to determine whether the final solution is 
a local minimum or not. Even with random initializations it is possible that different 
initializations are almost the same, or the weights of different neurons in the same 
layer may be almost the same. 

It is more profitable to distribute the initial weight vectors in some layer of neurons 
equally spaced over the weight space and to guarantee that subsequent initializations 
are different. i 

For instance, if we have a two-dimensional input with six neurons in the first layer, 
we can for a first initialization distribute the six (non-extended) weight vectors w; 
over the two-dimensional weight space as illustrated in Figure 3.80. For a second 
initialization we can turn the set of vectors through an angle of 7/6, etc. There remains 
the selection of the threshold weights w,) for each neuron j. The extended weight 
vectors W,=[Wjo, Wj Wj2] determine the separating hyperplane Wx =0 realized by 
neuron j. To ensure that the inputs contribute to the output of a neuron, the weighted 
input must not be far from the separating hyperplane. This can be done by selecting 
the weight wọ such that the separating hyperplane goes through the centre of gravity 
of the input data: 


Wjo t Wy Xie H Wj2X 20 =9 


with x,,= E x;,/N and x,,= X;2/N, where N is the number of examples in the training 
set. 

This method for selecting the initial weights is not straightforward for the 
n-dimensional input. We invite the reader to develop a simple strategy based on the 
same method for the n-dimensional case. 

Another, more brutal method for initialization is to give every weight (except Wj) 
a value of 1, or —1. Thus the initial influence of an input x;; is positive, zero or 
negative. In the case of an n-dimensional input we generate the set W, of all diferent 
n-dimensional vectors with components 1, 0 or —1: W,={1,0, — 1\". If we have k 


n ; 
hep f 
E: | 
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neurons in some layer, we select from W, the collection C, of all subsets containing k 

different vectors. For subsequent initializations we select the different subsets from C,. 
The value of the threshold weight vj. of neuron j is selected as before by requiring 

that the separating hyperplane goes, through the centre of gravity of the inputs: 


43 Boise 
W jo FE Wj Xge =O 


with x,,=2,%,/N and N the number of examples x; in the training set. 

Given those sets of initial weights we have to scale the input vectors such that the 
absolute weighted input |s{x,)| <5 for every neuron j for every input x; in order to 
guarantee that the initial derivatives df/ds will not become too small. 

If Xmas = Max {|x,{} over all i and j then: 


í ë } , ) s 
max;fls âx) < max {|W jo £ nX mal} 
Let wž be the value of Wjo for which |w jg + nX maxl is maximal; then every input vector 


will be rescaled to: 


t>5 


S—w§ 
0 à 
xk= pot x; if max;{wyo} +n max; ;{X;;} 





S+w§ beats . 
del 25x; if min,{wjo} +n min, {x,;3 <5 


3.22 Exercises 


1. Determine the weights for an optimal classification with one single neuron with 
sigmoid transfer function for the ‘mushroom classification problem’ as illustrated 
in Figure 3.4. We want the output for the healthy mushroom to be equal to | 
and for the other class equal to 0. 

2. Explain why the adaptation of the extended weight vector W containing all the 
weights of a neural net must be equal to AW= —eVE for a sufficiently small value 
of e. 

3. Check whether or not local learning is justified by our theory concerning the 
adaptation of weights. j 

4. Determine the adaptation of weights for a single-ncuron Perceptron with sigmoid 
transfer function if the training set contains the elements {1, 1} with target 0.8, 
and (1, —1] with target 0.2. The initial distribution of weights is wọ = 1, w, =0.5. 
Use for your convenience Figures 3.25 and 3.26 for f(s) and df /ds. 

5. Calculate a set of weights for the problem of Exercise 4 such that the MSE is zero. 

6. Check whether or not a single neuron can realize a function gy(x) with MSE 
E=0 for the following elements: [1, 1] with target 0.8, [1, — 1] with target 0.2, 
[— 1,1] with target 0.2, and [0,0] with target 0.2. 

7. Determine the weights of a single-neuron Perceptron with sigmoid transfer 
function such that the MSE is minimal for a training set with the elements [t t} 


10. 
. Given the data set {— 1, —0.5, 0, 0.5, 1} with targets (—1)=0, t(—0.5)=0.25, 


12. 
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with target 0.8, [1, —1] with target 0.2, [—1, 1] with target 0.2, and [0, 0] with 
target 0.22. s 

With the hyperplane boundary classification with one-zero labelling it can happen 
that at a certain stage of training, the classification is correct for the training set 


but the weights must still be adapted. Explain why. What is the result of prolonged a 


training where the classification is already correct? 


_ Determine the first adaptation of the weight vector for Example 3.9 if the initial 


weights are wo = —3, w, = — 1 and w,=1. Take £= 10. 
Explain the two local minima with E=1 in Figure 3.40 for |w|= 100. 


1(0)=0.5, #(0.5)=0.75 and t(1)=1. Determine the weights of all neurons with 


' sigmoid transfer function such that the MSE is almost zero if we use a neural 


net with five neurons in the first layer and one neuron in the second layer. 
Determine the weights of a neural network with four neurons in the first layer 
and one in the second layer, each with sigmoid transfer function such that the 
one-dimensional function g,(x) is zero for x< —1! and x>1 and will increase 
almost linearly from gy(— 1)=0.73 to gy(1)=0.87 in the interval [—1, 1]. 
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THE SELF-ORGANIZING NEURAL 
NETWORK 


4.1 Introduction 


Among the different types of artificial neural networks, the self-organizing neural 
network discussed in this chapter resembles real biological neural networks more 
than the other types. The artificial neural network was first introduced by Kohonen 
(1982) as the ‘self-organizing feature map’. 

If it is truc that the self-organizing neural network is a realistic, although very 
simplified, model of the human brain we can get some idea about how the brain 
might store pictures and how human beings could be able to recognize pictures. 

In the next section we will show how pictures can be stored and recognized with 
a self-organizing neural network. The behaviour of the self-organizing neural network 
can be replaced by some equivatent algorithm which is easier to implement and is 
almost always used in applications of the self-organizing neural network. We will 
call the equivalent algorithm the self-organizing neural net algorithm (Kohonen, 1988). 
Due to the artificial and sophisticated mathematical operations in the algorithm the 
resemblance with real biological neural networks is then lost. 

As an introduction to the self-organizing neural network of Kohonen, we will 
explain in the next section the structure and behaviour of that neural net by describing 
the application of the neural net to the storage of visual images in a neural network. 

Apart from the next section, we will deal with the equivalent self-organizing 
algorithm in the remaining sections of this chapter. 


4.2 Anthropomorphic pattern recognition with a 
self-organizing neural network 


If people were not able to perceive different sensory data as equivalent they would 
not be able to survive, eg. children would not recognize their own mothers. The 
human ability to recognize, apparently without much effort, different pictures as 
equivalent, challenges scientists to copy the ncurophysiological mechanisms involved 
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_ in human pattern recognition. For this reason the processing of information by 


artificial neural networks has gained a lot of interest of many scientists in recent years. 
In this section we will show how pattern recognition can be performed in a 
anthropomorphic way. The observation of a picture and the preprocessing of observed 


data incorporates to a certain extent neurophysiological phenomena. A self-organizing | 


neural network is used to realize a retinotopic mapping from the ‘retina’ to the 
‘cortex’. The cortical representation of the observed pattern can then be used as a’ 
template for pattern recognition. 

Besides using an artificial neural network for pattern recognition we will take into 
account some additional anthropomorphic mechanisms of visual perception to obtain 
a more human-like way of processing pattern information. 

Important phenomena of human visual information processing are as follows: 


1. Visual acuity. 
2. Eye movement during visual perception. 
3. Retinotopy. 


By using these aspects of information processing by the human visual system we will 
demonstrate that we are able to recognize pictures in an elegant and straightforward 
artificial way if we adopt in addition the following neurobiological mechanisms: 


4. Adaptation of neurosynaptic efficiency. 
5. The self-organization of a neural network. 


1. The visual ability of humans to distinguish the components of a pattern depends 
on the angular distane between the components and the eye axis. The resolving power, 
or acuity, is defined as the reciprocal of the visual angle subtended by the smallest 
details that the eye can distinguish. The resolving power declines sharply outside the 
central fovea (see Figure 4.1): at five arc minutes from the centre the acuity is reduced 
by 50 per cent. This implies that the information obtained by observing a pattern at 
one fixed point is far from complete and only at the centre of observation is the 
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Figure 4.1 The relative visual acuity of man 
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Figure 4.2 The rectangular observation window 


Figure 4.3 The circular observation window 


information accurate; in peripheral areas only some diffuse and global structure of 
the pattern is perceived. 

The human method of extracting information from a pattern can be simulated by 
observing that pattern through a window composed of observation fields of different 
sizes. Each field covers a different area of the pattern, and in each field only the 
average illumination of the pattern can be observed (Veelenturf, 1970). The greater 
the distance between the centre of the window and the centre of a field “the 
greater the area of the pattern covered by that field and hence the lower the resolvit uy 
power. By using windows of the form of Figures 4.2 or 4.3 we can APptOLIM IG itis 
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human way of extracting information from a pattern. We will give a more exact ` 


description of the observation window below. 
Jn addition, our vision becomes sharper when the light of a pattern is brighter. 
Peripheral areas of the retina are more sensitive to light and less accurate, while 


central areas are less sensitive to light but more accurate. We can simulate this 


mechanism by weighting the observed illumination values of a window field by some 
appropriate factor. ` 

2. When we look at a pattern, our eyes jump spontancously about three times per 
second over about seven minutes of arc. These sudden jumps are called saccades. 
One of the functions of the saccades is to allow the visual system to use the most 
specialized area of the retina, the fovea, to process detailed visual information at 
successive fixation points. In the intersaccadic intervals only small amplitude 
movements, the drift, persists. 

Further insight into the role of eye movement has been gained by investigation of 
stabilized images (e.g. Gerrits and Vendrik, 1972). The perception of well-stabilized 
retinal images disappears within a few seconds, showing the need for eye movement 
in normal vision to maintain visual perception. 

Another psycho-physiological result shows that the size of saccadic movements 
increases with the size of the observed pattern (Stassen, 1980). Moreover it seems 
that the movements of the eye occur in a systematic fashion related to the individual 
stimulus and to elements of the pattern that contain the most information (Baker 
and Loeb, 1973) (see Figure 4.4). Semantic evocative memory will also play, an 
important role in the directed scanning of a picture (Piaget, 1969). : 

The first phenomenon can be simulated artificially by successive random 
displacement of the observation window, mentioned above, across the pattern to be 
classified, followed by observation processing at each point. 





Figure 4.4 Saccadic eye movement 
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Figure 4.5 The circle and triangle used by the experiments of von Senden 


The third phenomenon can be captured (without taking into account the semantics) 
to some extent, e.g. by moving the centre of the window at cach step from one point 
of fixation to the centre of a highly illuminated field in that observation or to points 
with a great illumination contrast between neighbouring fields. 

3. There exists some evidence for a more or less isomorphic mapping from the 
retina to some part of the cortex which is called retinotopy. This does not necessarily 
mean a photographic representation of the observed environment, or a metrically 
faithful copy, but must be understood as some feature preserving imaging, which 
results in a representation of topologically relevant features of the pattern observed. 

The next experimental observation shows that the mapping from the retina to the 
cortex is not genctic but a result of acquired visual experiences; it also reveals that 
the mapping must preserve some, metric properties. 

When people with normal vision are confronted with the patterns of Figure 4.5 
they can tell the difference spontaneously. When, however, the patterns are presented 
to adults who have been blind from birth and have then been given sight by an 
operation, these subjects are unable to detect immediately the difference between a 
triangle and a circle. After a few months some patients are able to recognize 
spontaneously the circle and the triangle separately without counting the corners of 
the triangle (von Senden, 1932). 

One might suppose that retinotopy requires the establishment of well-ordered 
connections between the retina and the visual cortex. It is true that neural fibers 
grow according to some genctic plan, approximately to those places in which they 
are later needed, but the plasticity of the structure of connections is insufficient to 
explain the mechanism of learning by experience. It is more likely that retinotopy is 
due to the modifiability of the efficiency of information transmission by synapses. 
Repeated use of a synapse in a neural circuit increases the synaptic efficiency. Thus 
when many optic fibers originating from many different parts of the retina are 
connected to some cell, the synaptic efficiency of all incoming fibers may be altered 
such that the pertinent cell becomes optimally sensitive to the illumination of some 
specific parts of the retina. In this way the physical structure of connections is not 


changed, but rather the functional structure of connections, though with the same 
result. 
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4. As mentioned above, the behaviour of a neural net can be changed by modifying 
the synaptic efficiency of transmission of information from one cell to another. This 
phenomenon can be realized in an artificial neural net composed of artificial neurons. 
In our case of simulating visual information processing, each input of every neuron 


is connected to a different field of the observation window described above (replacing . 


the retina). The output of each neuron is some monotone increasing function of all 
weighted input values. The impact of synaptic efficiency can be modelled by 
multiplying each value of observed illumination in a field by some real number, called 
the weight of that input. The value of a weight will be incremented if there exists a 
positive correlation between the value of the particular input value and the response 
of the neuron to whom the input fiber is connected. This way of modifying the weights 
is called the Hebb rule, after D. O. Hebb who was among the first to envisage the 
role of this learning mechanism in lasting synaptic changes (Hebb, 1949). 

5. With the observation window we obtain isolated local information about the 
pattern at different (randomly selected) positions in the pattern. We have to realize, 
however, a cortical representation of the total pattern observed on the retina. This 
implies that neighbouring (similar) observations are mapped on neighbouring neurons, 
ie. the neighbouring neurons will respond optimally to similar (neighbouring) 
observations. For pattern recognition this requirement is still insufficient because the 
relative position of pattern components with respect to other components in the 
pattern is crucial. Therefore the mapping must preserve the positional interrelationship 
between observations at different fixed points. Thus we have to mimic the topological 
feature while preserving mapping from the retina to the cortex. In addition’ the 
mapping must be learned by experience. 

Using the self-organizing artificial neural net proposed by Kohonen, and scanning 
the pattern with the window mentioned above, we can simulate the learning of the 
‘topological feature preserving mapping’. At each observation all the synaptic weights 
(represented by a weight vector w) of the ‘winning neuron’ (i.e. the neuron with the 
greatest response) and its neighbours will be adapted to the pertinent observation 
vector v such that the winning neuron and its neighbours become more sensitive to 
that observation v. After adaptation the weighted input £ w;v; of the winning neuron 
and its neighbours will increase for the observation v. The feature that preserves 
self-organizing mapping is mainly caused by this process of selective adaptation of 
synaptic weights and the lateral excitatory or inhibitory connections between the 
neurons. We will describe this structure in more detail below. Essential for the 
observation with the window is that, if v(x;, y) is the vector of illumination values 
of the fields of the window, and the window is centered at position (x; y,), then the 
following relation holds: 


if dy {V(x yi), X; yt <dy {U(X yi), WX hi 


then de{(x;, Yih (Xj, yp} <dg{(Xi, Vids (Xe Yd 
with dy some distance measure in the n-dimensional input space V of observation 
vectors v, and dẹ being the Euclidean distance in the two-dimensional Euclidean 
picture space. 
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Figure 4.6 The capital ‘G° used as a picture 


In subsequent sections we will give a mathematical justification of our claims, 
which will turn out to be mainly a form of vector quantization together with an 
‘neighbourhood’ ordering duc to the lateral activation in the neural net. 

We will now give a more detailed description of the anthropomorphical pattern 
recognizer. 


Given some picture in a two-dimensional Euclidean object space O. We divide the 
plane in identical small squares: the pixels. The coordinates of a pixel are represented 
by a pair of integers (x, y). Each pixel in the plane has a pixel value: p(x, y). When 
the picture covers a pixel we assign the pixel value 1, otherwise p(x, y)=0. 
The picture P is defined by a set of pairs: P = {[(x, y), p(x, YIJIG, y)eO} (sce Figure 4.6). 
The picture is sampled at the position e;=(x;, y;) with a square window W,(e;) (sec 
Figure 4.7 with h=3) which covers 3" x 3" pixels, with its centre at position e; in the 
plane. The window has h resolution levels. Each level k of the window consists of 
eight square fields, except the zero level which consists of one ficld. The total number 
of fields is equal to 8h +1. A field on level k covers a square of 3*~! x347! pixels of 
the plane. The field at level zero covers only the pixel at the central position e; 
of the window. l 
Figure 4.7 represents what is observed through the window if the centre of the 
window is located in the second row and the ninth column of the picture of Figure 4.6. 
With the jth field on the level k, there corresponds a field value fy; defined by: 


. I 
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with (x, y) in the jth field on level k, and w, a constant for scaling the contribution 
of the illumination of a field at tevel k. 
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Field 3 on level 3 


Field 3 on level 2 

















Figure 4.7 The result of the observation of the capital ‘G° with the center 
of the window in the second row and at column 9 


From the set of field values we construct an observation vector v. For each field 
value fy j there exists a unique element v; in the observation vector v. The observation 


vectors obtained by sampling the picture with the window at random positions are 
the input vectors of a self-organizing neural network. 


Learning with the self-organizing neural network 


The neural network consists of N artificial neurons in a two-dimensional lattice (see 
Figure 4.8). Each neuron u, has one output line and the value of the output at time 
t will be denoted by n,(t). An observation vector v, obtained by sampling a picture 
with a window W,(e,, is the input vector for each neuron in the neural net. 

Each neuron has the same set of 8h+1 external input lines and the value of the 
jth external input at time t will be given by vt). (In Figure 4.8 the input lines are 
depicted for only one neuron.) The value v(t) is the jth element of the observation 
vector v(t) mentioned above. Each external input value v{t) of neuron u, is multiplied 
by a synaptic weight factor w,<t). 

If there are N neurons in the neural net, then there are in addition N—1 internal 
input lines for every neuron in the net that arrives from every neuron in the neural 
net. (In Figure 4.8 the internal input lines are depicted for only one neuron.) 

The value 7,(t) of neuron up, multiplied by a lateral synaptic weight factor Y, 
constitutes the value of the kth internal input of neuron u,. When d(r, k) is the distance 
between neuron u, and up, then the lateral weight factor ;,, will depend on this 
distance. The lateral synaptic weight function may have the form given in Figure 4.9. 
When +, is positive we say the lateral effect of neuron u, on neuron u, is excitatory, 
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Figure 4.8 The observation window as an input for a two-dimensional neural 
network 
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Figure 4.9 The lateral synaptic weight function 


if y,, is negative we say the effect is inhibitory. This kind of lateral feedback has been 
known for a long time in neuroanatomy (c.g. Edelman and Finkel, 1985). 

The value of the output 9,(¢-+A) of ncuron u, at time +A is given by some 
non-linear monotone increasing function J, of the total weighted input ,(t) at time 
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Figure 4.45 Classification boundary for the nearest-neighbour method 


In Figure 4.49 one can see what is happening to the classification boundary if we 
use two neural networks with 4 x 8 neurons. The number of neurons in each net is 
larger than the number of samples in the data sets D, and Dg. Note that almost 
every example of the data sets is represented by some weight vector in the neural 
nets. The redundant weights are automatically placed by interpolation between 
examples of the data sets. 

If besides the examples of the data sets DSX, and Dgc Xg, one has some 
additional information about the classes X, and Xp, then the nearest-neighbour 
classification might not be optimal. Suppose, for instance, one knows (or assumes) 
that there are some class conditional probability density functions f (v|A) and f(v/B), 
then, in the case of equal cost of misclassification and equal class probabilities p(A) 
and p(B), one must assign an input v to class X, if f(v|A)> f(v|B) and to class Xg 
if f(v|B)> f(v[A). This will not, however, be the case if we use the nearest-neighbour 
method. If one assumes the existence of overlapping probability distributions by 
which the examples are generated, it is better to use the Bayes method that we will 
discuss in the next section. 
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Figure 4.46 Result of vector quantization for class X4 with a 2x2 neural 
Kohonen network A 


Example 4.7 


We consider a one-dimensional case. We have two finite data sets D, and Dya. The 
elements of D, are generated according to some Gaussian distribution density function 
with a mean of „p, =0.0 and a deviation of o, = 1.00. The other set is generated by 
distribution function with ua =2.00 and deviation ¢g=1.50 (sce Figure 4.50). The 
optimal boundary for classification is t= 1.09. The other boundary is t= — 4.29 (not 
given in Figure 4.50). 

We train with the data set D, a one-dimensional self-organizing neural network 
algorithms with five neurons. The valucs of the final weights of this network A are 
given in the first row at the bottom of Figure 4.50. We do the same for the data set 
Dy. The final weights of network B are given in the second row. If we now apply the 
nearest-neighbour method, all inputs in the receptive field of the leftmost weight of 
neural net B are wrongly classified as elements of class X, (see the last row of 
Figure 4.50). In the same way all inputs in the receptive field of the rightmost weight 
of the neural net A are wrongly classificd as members of class X4. E 
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Figure 4.47 Result of vector quantization for class Xp with a 2x2 neural 
Kohonen network B 


4.9 The Bayes classification with a self-organizing neural net 
algorithm 


If one knows or assumes that the examples of data sets of different classes are generated 
according to some underlying probability distribution functions, then the best thing 
to do is to estimate the parameters of those distributions from the given data sets, 
and use this information to determine a threshold for classifying new data. 

In case of a two-class classification problem (with equal class probabilities p(A) 
and p(B) and equal costs for misclassification) one assigns an input v; to class X4 if 
the class conditional density function f(v,|A) is larger than the class conditional 
density function f(v,|B). This method of classifying inputs can be realized by a 
self-organizing neural net without separately estimating the parameters of the 
underlying probability distributions, because the neural net will do the job for us. 

First we have to make a slight modification of the algorithmic adaptation rule. We 
append both the input vectors v; (hereafter called the master input vector) and the 
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Figure 4.48 Classification boundary with the nearest-neighbour method for 
the representations in the neural networks A and B 


weight vectors w, (hereafter called the master weight vectors) with a so-called slave 
vector ¥; (respectively W,) with a number of components equal to the number of 
classes. The newly appended input vector will be denoted by ¥;=[y,, ¥,]. The appended 
weight vector will be denoted by W,= [w,, W,]. 

If an input vector is an element of the data set of the kth class, then we make the 
kth element of the slave input vector equal to 1; if it does not belong to the kth class, 
then we make that component of the slave input vector equal to 0. The components 
of the initial master weight vector and of the initial slave weight vector are randomly 
chosen between 0 and 1. The algorithmic adaptation rule is then as follows. 

Given some master input vector v(t)=v, with (t) =[v,, ¥,]. 


1. Determine the winning neuron u, for the master vector v(t), i.e. 


dy[w.(.), v(t] = min dy[w,(2), v(0)] 


2. Every weight vector wW,(t)=[w,, W] in the net will be changed to: 


W(t + I= l) + g(r, s, DEE) — W,(t)] 





The Bayes classification 215 





(mm) 


Figure 4.49 Classification boundary with the nearest-neighbour method for 
the representations in two 4 x 8 neural networks A and B 


with g(r, s, t) a scalar-valued function with a value between 0 and 1 depending r 
time and on the distance in the neural net between the winning neuron u, and the 
neuron u, to be adapted. 


To explain the final result and what is happening during the learning phase we 
confine ourselves to the two-category classification problem. In this case we can use 
a one-dimensional slave vector. The one-dimensional slave vectors for master ate 
of D, will be given a value 0 and the one-dimensional slave vectors for master 

will be given the value 1. 

pe ey ee vector ani aO property of the algorithm the final weight vectors 
on the one hand will become similar to the elements 9; corresponding to the master 
vectors of the data set D,&X,, and on the other hand similar to sei T 
corresponding to the examples v; of the data set DaS Xp. In a EER oft A 
space (= weight space) where there are only elements of D4, the slave e oe a 
weight vectors will be permanently adapted to a value 0, and in a region a 2 
by elements of D, the slave elements of the weight vectors will be mainly adap 
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Figure 4.50 Top diagram: two Gaussian-distributed, one-dimensional classes. 
First row: weights obtained after vector quantization with a 
one-dimensional net of five neurons for data with the left 
distribution. Second row: weights obtained after vector 
quantization with a one-dimensional net of five neurons for data 
with the right distribution. Third row: classification result with 
nearest-neighbour method 


to a value 1. In a region of the input space where the unknown class conditional 
probability density function are the same, f(v|A)= f (v|B), the number of elements of 
D, and Dg will be almost the same. In that region the slave elements of the weight 
vectors will be adapted as many times to value ! as to value 0, thus in those regions 
the slave elements of weight vectors will become equal to 0.5. 

The final result will be that a master input vector v,eX , will belong to the receptive 
field R(w,) of a master weight vector w, with a slave element with a value smaller 
than 0.5, and if veX,, then v; belongs to the receptive field of a weight vector with 
a slave element value larger than 0.5. 

After the learning phase the neural net can be used as a classifier: given some input 
vector v, one can determine the ‘winning neuron’, then the slave value of the 
corresponding weight vector indicates whether ( <0.5) or not (> 0.5) the vector belongs 
to X,. 

The weight vectors with a slave value of ~(.5 will be located near the optimal 
discrimination curve. We applied the method to the two-dimensional classification 
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Figure 4.51 Top diagram: two Gaussian-distributed, one-dimensional classes. 
First row: the value of the slave element of the weight vectors 
obtained after vector quantization with a one-dimensional net 
of ten neurons for data of both distributions. Third row: 
classification result with the Bayes method 


problem discussed in Example 3.8 with a two-dimensional neural net with 10x 10 
neurons. We found a classification error of 5.45 per cent. 


Example 4.8 


We consider a two-category classification problem with a one-dimensional T 
The data set D, is generated by a Gaussian distribution function with ge R 
ua =0.00 and deviation o,=1.00. The data set Dg is generated by a meee 

distribution with jig = 1.00 and oy = 2.00. A histogram for the frequency ie e ae 
of Dau Ds in the training set, together with the class conditional density function, 

i in Fi I. 

AAN ae one-dimensional neural network with ten neurons. In a ies 
row below the histogram in Figure 4.51 we have plotted the value of the on e n 
of the ten weight vectors. We observe that for input elements v with ey r x ee 
the slave value is larger than 0.5 and thus will be classified correctly. The final ro 


in Figure 4.51 gives the values of the master weights of the ten neurons after learning. 
a 


-Nef 
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u,= (-2,2) 
6,= (2,2) 


H= (1,1) 
o= (2,2) 


p= (0,1) 
o= (1,1) 





Figure 4.52 A three-class, two-dimensional data classification problem 


Example 4.9 


We consider a three-category two-dimensional classification problem. The three 
classes were generated with the following Gaussian probability distributions (sce 
Figure 4.52): 


class A with pa =(— 2, 2) and o, =(2, 2) 
class B with j4,=(1, 1) and a, =(2, 2) 
class C with pe =(0, — 1) and o, =(1, 1) 


The neural network was two-dimensional with 10 x 10 neurons. The input vectors 
of the data set were extended with a slave vector with three components. The first 
component is equal to 1 if the input vector is an element of class A; if not, then the 
value will be zero. The second component is only equal to 1 for input vectors from 
class B and the third component is only | for elements of class C. The weight vectors 
of the neurons were five-dimensional with the first two representing the master vector 
and the remaining three the slave vector. All weight values were randomly initialized. 

The results after training with 1000 examples (1000 learning steps) are represented 
in Figure 4.53. The x, y coordinates of a symbol (A, B or C) give the values of the 
first and second weight components. These master weight vectors represent the 
quantized data set. Ifa symbol is equal to ‘A’, then the first component of the pertinent 
slave weight vector has the largest value. The same holds for the symbols ‘B` and ‘C 
with respect to the second and third slave components. The four bold face symbols 
in Figure 4.53 are incorrect. The three rightmost vectors with symbol A have to be 
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Figure 4.53 The weights after learning with the data of Figure 4.52 ina 
two-dimensional 10 x 10 neural net. A weight is labelled as ‘A’ 
if the first slave element is larger than the other two slave 
elements. Classes ‘B’ and ‘C’ are labelled similarly 


classified as elements of class B. The bold face symbol C must be an A. A larger 
neural net would improve the results. r | 


4.10 Application of the self-organizing neural net algorithm to 
the classification of handwritten digits 


If we sample different pictures of a some class of pictures (in our case handwritten 
representations of some digit, see Figure 4.54) with the window introduced in 
Section 4.2, and present in a learning phase the observation vectors v obtained by 
that window to a self-organizing net, then the weight vectors will become similar to 
those observations that are common in all pictures in that class. In this way the 
topological features of a class of pictures will be stored in the weight vectors of the 
neural net. 

We performed a classification experiment for handwritten digits. Figure 4.55 shows 
some examples of handwritten digits. We used the nearest-neighbour method discussed 
in Section 4.8. 

We used ten two-dimensional self-organizing neural networks of 7 x7 neurons, 
one network for each class of handwritten digits: ‘0, '1,..., ‘9. Each handwritten 
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Figure 4.54 A handwritten ‘3’ observed with the window 





Figure 4.55 Examples of handwritten digits 


digit was presented in a square of 30 x 40 pixels. The centre of the observation window 
can be placed at 30x40 different locations in a picture, giving 1200 different 
observation vectors v; for each example of a digit. Each net was trained with 
10 x 1200 observations from ten different handwritten examples of one type of digit. 
In the learning phase each observation vector was twice presented to the neural 
network. If a network was trained with examples: of digit i we denote that network 
by N;. After learning, fifteen new handwritten examples of each class were used as a 
test set. For each example 1200 observations were presented to the ten neural networks. 
An observation vector v obtained by sampling an example was assigned to a net i if 
|v—w,|=min,|v—w,, with w; and w; the weight vector of the winning neuron in neural 
net N;, respectively Nj. If the majority of the 1200 observation vectors of one example 
was assigned to neural network N,, then the example was classified as the digit k. 

In Figure 4.56 we have given an outline of the classification procedure and the 
result of the classification for a handwritten representation of the digit ‘6’. (Note that 
we can use the typical distribution of the allocations of observation vectors of some 
digit to the different nets as a criterion for classification.) 
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Figure 4.56 Outline of the classification procedure 


From the 150=10x 15 examples used in the test set, six examples were wrongly 
classified, ie. a score of 96 per cent. To train the ten neural networks took 9 min on 
a HP 9000. The classification of one example in the test phase took 2 sec. 

The same classification score was obtained when we only used twenty randomly 
selected observation vectors out of the 1200 possible observation vectors of some 
test digit. In this way we obtained a classification time of about 0.1 sec. 

We see that if a class of pictures have some topological features in common, we 
can use the neural network for pattern recognition in a straightforward way. 


-Nef 
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4.11 Topology preservation with a self-organizing algorithm 


In Section 4.5 we found that by using the adaptation rule: 
Wet De wld) + e(OA(r, s, DEV) — we] 
we are minimizing the error function: 
1 
E(W)= D ÈE È eDplv)hir, s, Dv — w? 
w, w, VER(w,) 
We could distinguish two different learning phases: 
1. The quantization phase (final phase) In that phase we have for the neighbourhood 


function h(r, s, t)}=0 for r#s and h(r, s, t)=1 for r=s, and we are minimizing: 


$ X plv): — w? 


w, YER{w,) 


EW) =~ 

2 
The quantization phase is characterized by the property of vector quantization: an 
input data space of M d-dimensional vectors will be replaced by a smaller 
‘representative’ set of N d-dimensional weight vectors of the neural net. 

2. The ordering phase (initial phase) We are minimizing: 


1 
E(w, = ae È p(w)g(r, 5, tw, —w,|? 


The ordering phase is characterized by the property that the neural net will be 
ordered: neighbouring neurons in the network will obtain similar weights. The net 
is well-ordered if neighbouring neurons have adjacent receptive fields. 

The approximation of the ordering error Eg(W, t) by: 


E,(W)= > » ciw, — W,1? 


SEA dilini) =; 


reveals more directly the property that we are reducing the sum of weighted mutual 
distances between all weight vectors in the ordering phase. The weight factor c; is 
large for ‘close neighbours’ (ò; = 1) of neurons in the neural net and will be small 
for ‘distant neighbours’ (d,> 1). 


The properties mentioned above deal mainly with the mapping of input vectors 
to weight vectors. Besides this quantization mapping ¢ from the space of input vectors 
V to the set of weight vectors W, there is a mapping w (the projection mapping) 
from W to the lattice L of neurons, because with each weight vector there is associated a 
neuron in the neural net L. So we obtain a so-called feature mapping Y =¢-w from 
V to the lattice L. In gencral the input vectors are obtained by observations 
(measurements) of some object space O (c.g. pictures or signals observed by some 
window). The observation mapping will be represented by the symbol a (see Figure 4.57). 
Frequently one is only interested in the representation of the input space V by 
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Figure 4.57 The set of interrelated mappings 


the weight vectors of W, and one disregards the position of the neurons associated 
with the weight vectors. 

In more sophisticated applications of the self-organizing neural net algorithm, the 
feature mapping y from the vector space V to the lattice L of neurons is used. We 
will say that an input vector v; is represented by neuron uj if uj is the winning ncuron 
when we present vector v; to the neural net: u,;=w(v;). It is frequently desired that 
similar input vectors are represented by the same neuron or by neighbouring neurons. 
This will not always be the case. If, for example, the training set D consists of many 
two-dimensional uniformly distributed vectors, and one is using a one-dimensional 
neural net (with two-dimensional weight vectors) with nine neurons, then the weight 
vectors will be uniformly distributed over the input space. The sequence of neurons 
u; associated with the weight vectors w; forms a chain through the input space. The 
input space will be divided by equally sized receptive fields R(w,) (see, for example, 
Figure 4.58). 

Similar input vectors on both sides of the border of, for example, R(w2) and R(ws), 
are represented by the neighbouring neurons uz and u,, but similar input vectors on 
both sides of R(w,) and R(w.) are represented by neurons uz and uy that are not 
neighbours at all. If we had used instead a two-dimensional neural lattice, then similar 
input vectors would always have been represented by the same neuron or by 
neighbouring neurons with distance 1 (see Figure 4.59). 

In Figure 4.59 the mutual relative position of two vectors v; and vj in the input 
space V is to a certain extent preserved by the mutual relative position of neurons 
Y(v,) and P(v,). The metric of the input space is preserved. 

A mapping from a metric space V with distance measure dy to a metric space L 
with distance measure d, is metric preserving if the triangular inequality property is 
preserved, i.e. if 


dy (¥,. Y) < y(v,, Y) + dylYs v) 


then 


d Epl) WAI S dEl), W] + dEl) WV] 


For metric preservation it is required that if the distance between two vectors from 
V is small, they will be represented by the same neuron or neighbouring neurons in 
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Figure 4.58 Two-dimensional data represented by two-dimensional weight 
vectors in a one-dimensional neural network 
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Figure 4.59 Metric preservation. Left figure: input space. Right figure: neural 
lattice with the projection ws of v, and the projection of us of v, 


the neural net. This requirement can be stated as follows. If two receptive fields R(w,) 
and R(w,) in V are adjacent (the common border @R,;= R(w)OR(W,) is not empty), 
then neuron u; with weight vector w; is a neighbour of neuron u, with weight vector wj. 

If this property holds for all receptive fields, we say that the feature mapping is 
topology preserving. One may note that topology preservation is the complement of 
the well-ordering property. In Figure 4.58 there is no topology preservation while in 
Figure 4.59 there is topology preservation. 
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Figure 4.60 Two adjacent receptive fields 


Because the self-organizing algorithm is minimizing the error function: 


EWL E pidh, s, Dvw? 
2 w, w, veR(w,) 

the self-organizing algorithm will optimize the topology preservation. However, 
complete topology preservation is frequently not possible because the dimension of 
the input space V is not always the same as the dimension of the neural lattice L. 

Although the feature mapping ¥ will try to preserve the topology of the input 
space in the neural lattice, it may happen that input vectors that are almost identical 
are represented by different neurons, whereas input vectors that are relatively more 
different are represented by the same neurons. In Figure 4.60 we have given two 
adjacent receptive fields R(w,) and R(w,) of two neighbouring neurons up and tiq with 
coordinate vectors i, and i, with distance d,(i,, i,)=!. The input vector v, is almost 
identical to input vector Y, whereas v, is more different from input vector v, We 
have, however: 


a EY), Yl = dilip iy) = | 


and 


d,Lylv,), Wv)] = dilip i.) =9 


This phenomenon is due to the discontinuity of the feature mapping W from V to 
L. The self-organizing algorithm will, however, try to minimize the discontinuity of 
the feature mapping. 

If the dimension of the input space V is the same as the dimension of the neural 
lattice L and the learning set Dy contains many examples homogeneously distributed 
over some bounded area, then topology will be preserved for that area. This implies 
that if the distance between two input vectors approaches 0, then the distance between 
the corresponding winning neurons is at most equal to 1. If, however, the dimension 
m of the input space is larger than the dimension n of the neural lattice, then in 
general there will be no topology preservation, and if the distance of two input vectors 
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approaches 0, then the distance between the corresponding winning neuron might 
become large (>> 1) (see also Figure 4.58). A continuous path in V results in a 


corresponding track in the lattice L with in general large jumps between the successive ` 


winning neurons. 


It might, however, occur that aor obtain topology preservation in a restricted 


sense if the training set D consists of m- dimensional vectors, whereas the neural lattice 
has a dimension n much smaller than m. This will be the case if the components of 
the m-dimensional vectors of the training set D are interrelated such that they can 
be represented by points in an n-dimensional space while preserving the topology 
for the elements of D. A trivial example is the set of three-dimensional vectors 
{C1 1, 2], £2, 1, 4], (3, 1, 6], (4, 1, 8]} that can be placed in a one-dimensional row by 
a mapping P (if ‘¥(v)=v,) while preserving the topology restricted to the set D, i.e. 
if 
dy(v,, v.) <dy(v,, Ys) + dy(Y,), (v,) 

then 


A Ely), WODI S dLE, piv] + diil), yi) 


A less trivial example of this phenomenon was given in Section 4.2 where we 
showed that the topology of the two-dimensional pattern of the capital ‘G’ was 
preserved in a two-dimensional neural net. In that case we used a training set D of 
1200 25-dimensional input vectors obtained by sampling the pattern with a special 
window. The observation window is constructed in such a way that the observation 
vectors can be mapped on points in a two-dimensional metric space with topology 
preservation between the data set D and that two-dimensional space. The dimension 
of the neural net used was two and the input space was restricted to D. We summarize 
our discussion in the next property: 


Practical statement 4.1 

Restricted topology preservation 

If the m-dimensional vectors of the training set D can be mapped on a finite number 
of points of an n-dimensional metric space with preservation of topology, then we can 


use an n-dimensional neural net and preserve topology restricted to the elements of D 
(or restricted to vectors almost equal to elements of D). 


4.12 Interpolation with the self-organizing algorithm 


There is yet another property of the self-organizing neural net algorithm that might 
be profitable for some applications. If the neural lattice contains more neurons than 
there are input vectors in the training set, then, due to the vector quantization 
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Figure 4.61 Input/weight space. The ‘b’s represent nineteen input vectors. 
The squares represent the weight vectors in a two-dimensional 
net with thirty-two neurons 


property, there will be at the final phase of learning a set of weight vectors that will 
be copies of all the input vectors of the training set. During training of the neural 
net the redundant weights will also be adapted to the input vectors corresponding 
to the weight vectors of the surrounding winning neurons. In this case the redundant 
weight vectors will in this way obtain values that one would find by interpolation 
between the values of the training set. 


Example 4.10 


In Figure 4.61 we have given the result of training a two-dimensional neural net with 
thirty neurons and a training set with nineteen two- dimensional input vectors 
represented in Figure 4.61 by the letter ‘b’. The final weights of the neurons are given 
by small squares. ] 


KEJ f 
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We can summarize our discussion as follows: 


Practical statement 4.2 n 7 
1A 
Interpolation property i 


If the neural net contains more neurons than there are elements in the training set 
D, then the redundant weight vectors will be interpolated between the weight vectors 
that are copies (or almost copies) of the input vectors of the training set. 


A frequently undesired interpolation of weight vectors between input vectors of 
the data set can, however, also occur if the number of elements in the data set is 
larger than the number of neurons but the clements of the data set D are separated 
by a relatively large empty area. We will explain this phenomenon shortly. In Section 
4.5 we found that the algorithm tries to minimize the value of |w,—w,|? for all pairs 
of weight vectors (w; w). By introducing a weight vector w, between w; and wj, if 
|w,—w,l” is large, a much smaller value of the replacement Iw; — wde + |W, wi? can 
be obtained. The weight vector w, is not, however, representing the input vectors. 


Example 4.11 


In Figure 4.62 we have given the result of training a two-dimensional network of 
5x5 neurons with 1000 input vectors uniformly distributed on a circle. Small circles 





Figure 4.62 Representation of 1000 input vectors uniformly distributed on 
a circle by weight vectors in a 5x5 neural network 


Master-slave and multi-net decomposition 229 


represent the final weight vectors. Weight vectors are connected with a straight line 
if the corresponding neurons are neighbours in the neural lattice. E 


4.13 Master-slave and multi-net decomposition of the 
self-organizing neural net algorithm 


In this section we will discuss the decomposition of input vectors and the application 
of the self-organizing algorithm to the different parts of the decomposed input vector. 


First we will discuss the master-slave decomposition as already applied in 


Section 4.9 on the Bayes classifier. 

In several applications of the self-organizing algorithm the set of data vectors 
consists of pairs of vectors. For instance, in the case of function identification from 
samples, the training set contains pairs of argument vectors and function-value vectors. 
In the case of function identification we want generalization (or interpolation) from 
samples. To obtain a proper result the different parts of the vectors of the training 
set must be treated differently. We want a metric that preserves quantization of the 
argument values and a representation and interpolation of the function values. 

One part of the data vectors will be treated by the self-organizing algorithm in 
the same way as was done in previous sections. We call that part of the input vector 
the master input vector y;. The second part of the input vector, called the slave input 
vector, denoted by ¥,, will not be used to find the winner in the neural net and will 
only be used to adapt the weight vectors in the neural net. The total input vector 
will be denoted by #;=[v;, ¥,J. 

In the same way we make a decomposition of the weight vectors of the neural 
lattice. One part is called the master weight vector w; and the other part is called the 
slave weight vector W,. The total weight vector will be denoted by W,=[w;, W]. The 
algorithmic adaptation rule is then as follows. 

Given some training vector ¥(t)=Lv;, Vid: 


1. Determine the winning neuron u, for the master vector v(t), i.e. 
dy[w,(), v(t)] = min dyLw,(0, v) 


2. Every weight vector w,(t)=([w,, W] in the net will be changed to: 
w+ = W(t) + g(r, s, DED — W,(9] 
with g(r, s, t) a scalar-valued adaptation function as discussed in Section 4.3. 


From the algorithmic adaptation rule we conclude that the master input vectors 
and the master weight vectors are manipulated as in the previous section. The 
preceding theory will thus hold in the same way for the master vectors. 

Due to the vector quantization property of the algorithm, the final complete weight 


: vert 
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vectors W will become similar to the elements 4; and thus the slave weight vectors 
will also become similar to the slave input vectors. 
If the number of neurons in the neural lattice is larger than the number of input 


vectors, then the values of weiglit vectors of the redundant neurons will be interpolated 


between the weight vectors that are similar to the input vectors. 

In the next two sections we apply the method of master-slave decomposition to 
function identification and control of a robot arm. 

We found in Section 4.11 that it is preferable to have the dimension of the neural 
net equal to the dimension of the input space. Frequently the dimension of the input 
space is large and would thus require the neural lattice to be similarly large. But the 
required number of neurons will grow exponentially with the dimension of the neural 
lattice. If the number of neurons per dimension is equal to d and the dimension of 
the neural net is equal to m, then d" neurons are required. This number may become 
larger than the number of elements in the training set and we cannot use the neural 
net for proper vector quantization. 

Moreover, if the dimension of the input space is larger than the dimension of the 
neural net, the property of topology preservation is lost; the greater the difference in 
dimension, the greater the number of defects. 

Therefore we frequently want the dimension m of the neural net to be low and 
equal (or close to) the dimension of the input vectors. A solution to this problem is 
to use the multi-net decomposition method. 

The multi-net decomposition method is straightforward: we divide the input vector 
in some way into k parts and we use k different neural nets. The pth neural net is 
trained with the pth part of the vectors of the training set. 

If we divide the original m-dimensional input vector into k equal parts of dimension 
m/k and use k neural networks of dimension m/k, and the number of neurons in each 
dimension of all subnetworks is d, then the number of neurons is reduced from d™ 
to kd™*, 

In Section 4.16 we will apply the multi-net decomposition method to EEG analysis. 


4.14 Application of the self-organizing algorithm to 
function identification 


Assume we have several pairs of argument values and function values of some unknown 
function, and we want to know the functional relationship between arguments 
and function values. If we do not require a mathematical description of the functional 
relationship but are satisfied with a (hardware or software) realization of the function 
in a restricted domain, then we can use a neural network to approximate the unknown 
function. In Chapter 3 we have shown how we can use a continuous multi-layer 
Perceptron to identify an unknown function. We can, however, also use a 
self-organizing neural net algorithm for that purpose. The main difference will be 
that we obtain a quantized version of the unknown function. 
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At first glance one is inclined to make training vectors composed of pairs of 
argument and function values. If we use these training vectors we know that the final 
weight vectors will be copies of these training vectors, and if the number of neurons 
is larger than the number of training vectors we also obtain weight vectors by 


interpolation between the training vectors. The weight vectors are also composed of - 


pairs of arguments and function values; the first part corresponds with an argument 
and the second part with a function value. After training, we present some argument 
of the function and determine the weight vector with an argument part with the 
minimal distance to the presented argument. Then we read in the pertinent weight 
vector the second part as the desired function value. However, this method will give 
incorrect results if there are weight vectors interpolated between the weight vectors 
that are copies of the training vectors. Interpolated weight vectors will be located in 
the area between the curves representing the functional relationship. 


Example 4.12 


In an experiment we applied 1000 training samples [x;, y,] of the function y= 10x? 
with x, in the interval [—0.3, +0.3] to the self-organizing algorithm. The neural net 
contained fifty neurons. The result is given in Figure 4.63 (the line represents the 1000 
training samples and dots the [x, y] value of the weight vectors). We observe 


-0.3 +0.3 x 


Figure 4.63 Representation of 1000 pairs of argument and function values 
of the function y=10x? by two-dimensional weight vectors 
(dots) in a one-dimensional neural net of fifty neurons. 
Input argument values in [ —0.3, + 0.3) 
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Figure 4.64 The representation of 1000 pairs of argument and function 
values of the function y=10x? by two-dimensional weight 
vectors (dots) in a one-dimensional neural net of fifty 
neurons. Input argument values in [~ 3, +3] 


90 
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Figure 4.65 The representation of 10000 pairs of argument and function 
values of the function y=10x? by two-dimensional weight 
vectors (dots) in a one-dimensional neural net of fifty 
neurons. Input argument values in [—0.3, +0.3] 


Function identification 
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Figure 4.66 The master-slave method. The representation of 1000 pairs of 
argument and function values of the function y=10x? by 
two-dimensional master-slave weight vectors (dots) in a 
one-dimensional neural net of fifty neurons. Input argument 
values in [—0.3, +0.3] 
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that even in this case where the number of elements in the training set is much larger 
than the number of neurons, we obtain interpolated weight vectors that are wrongly 


located. 


In a second experiment we used 1000 samples [x;, y;] of the function y= 10x 


2 from 


the domain [—3, +3]. The result, given in Figure 4.64, is even worse. If we use more 
samples (i.e. 10000) and extend the learning phase to 100000 steps, then we obtain 


the result given in Figure 4.65. 


Proper function identification with the self-organizing neural net algorithm can, 
however, be obtained by using the master-slave method presented in Section 4.13. 
If the argument x of the unknown function is m-dimensional, then we use 
an m-dimensional neural net and take x as the master vector. The corresponding 
n-dimensional y function value vector is used as the slave vector. If there are enough 
samples, then the x values of the weight vectors will be ordered in a regular 
m-dimensional lattice. The slave elements will be copies of the function values and 
if there are more neurons than samples, additional y values will be interpolated 


between the y values that are copies of the slave-training samples. 


want 
i 
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Figure 4.67 The master-slave method. The representation of 1000 pairs of 
argument and function values of the function y=10x? by 
two-dimensional master-slave weight vectors (dots) in a 
one-dimensional neural net of fifty neurons. Input argument 
values in [3, +3] 


Example 4.13 


If we repeat the experiments mentioned in the previous example with the master-slave 
method, then we obtain under the same conditions the results respectively given in 
Figures 4.66—4.67. a 


4.15 Application of the self-organizing algorithm to robot arm 
control 


Suppose we use a monitor connected to a camera to observe an object on a square 
table. The coordinates of the table top will be denoted by u and v. We want a robot 
to learn to grasp the object from the table given the x and y position of the object 
on the monitor screen (not u and r). For simplicity our robot consists of an arm with 
two parts moving in a horizontal plane (see Figure 4.68). With two servomotors we 
can control the two angles H, and H, in order to reach every point on the table. 
In a training phase the object is placed somewhere on the table and we form a 
vector v with the observed values of x, y, H, and H, as the four components. Note 
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shoulder: 88.20 elbow: 3.60 
Xm: -0.00 Ym: -0.00 
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Figure 4.69 Initial representation of table coordinates by the two-dimensional 
master-slave weight vectors (dots in the right-hand figure) in a 
two-dimensional net of 20x 20 neurons. Weight vectors are 
connected with a line if they belong to neighbouring neurons 
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Time : 446 
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Figure 4.70 Representation of table coordinates by the two-dimensional 
master-slave weight vectors (dots in the right-hand figure) in a 
two-dimensional net of 20x20 neurons after 446 training 
examples. Weight vectors are connected with a line if they belong 
to neighbouring neurons 
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Figure 4.71 Representation of table coordinates by the two-dimensional 
master-slave weight vectors (dots in the right-hand figure) in a 
two-dimensional net of 20 x 20 neurons after 10002 training 
examples. Weight vectors are connected with a line if they belong 
to neighbouring neurons 
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that the picture on the screen does not give the u and v position of the object on the 


table top but the x and y position of the object in a perspective view on the monitor. . 
In the training phase we place the object in 10000 random positions on the table. 


In this way we obtain 10000 training vectors. We use the master-slave decomposition 
method with [x, y] the master vector and [H,, H2] the slave component. We use.a 
two-dimensional neural network with 20 x 20 neurons. After training the object, is 
placed on the table and we observe the [x, y] on the monitor. We apply this vector 
to the neural net and determine the winning neuron for this master vector. The 
corresponding slave vector will give us the information for the correct value of 
[H,, H,] to grasp the object. i 

. After learning, we have 400 weight vectors. For each slave weight vector [H,, H2] 
there is a corresponding value of the table top coordinates [u, v]. These 400 values 
of [u, c] are given by the corner points of the lattice depicted in Figures 4.69-4.71 
after zero training examples, 446 training examples and 10002 examples, respectively. 
We see that after 10002 training steps 400 [x, y] positions on the monitor are 
translated into 400 (almost) correct positions for the robot arm. 


4.16 Application of the self-organizing algorithm to EEG signal 
analysis 


In recording electroencephalograms (EEGs) of epileptic patients one frequently 
observes are irregular intervals certain short wave forms called spike-wave complexes 
(SWCs) in which a spike is followed by a slow wave (see Figure 4.72). The duration 
of such a SWC is about 0.5 sec, and SWCs can occur in sequences without interruption. 
EEGs are sometimes recorded over a 24-hour period, and it takes considerable time 
to screen such a recording for the occurrence of SWCs. Automatic detection and 
quantification of SWCs would be very useful. 

We can use a self-organizing neural net algorithm after a training period for the 





Figure 4.72 Two spike-wave complexes in an EEG signal 
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detection of the SWCs. The EEG signal is sampled first with a frequency of 366 Hz. 
We can observe a complete SWC with a window of about 180 samples. When we 
move the window along the EEG signal we obtain a large set of observation vectors 
composed of the values of the: 180 consecutive samples of the window. If we move 
the window with a step size of one sample we obtain in this way 60 x 366 different 
observation vectors for a recording of | min. 

Some observation vectors correspond to SWCs and others to observations of the 
ordinary EEG signals. We can use the self-organizing neural algorithm to make a 
vector quantization of the total set of observation vectors. In this way we can obtain 
weight vectors that are copies of clusters of observation vectors. Observation vectors 
obtained with the window located on SWCs will result in similar weight vectors (and 
interpolated weight vectors) in the neural lattice. After training, we label the weight 
vectors that are similar to observation vectors obtained from SWCs as elements of 
the SWC class. If after training we present an observation vector of a new recording 
to the neural net, we can determine whether or not it is similar to a weight vector 
of the class labelled as SWC vectors. 

If we develop a classification system as described above we need, for a proper 
vector quantization and interpolation, a neural net with the same, or almost the 
same, dimension as the dimension of the input vectors. This would require a 
180-dimensional neural lattice. The number of neurons will become very large (d!®°, 
if d is the number of neurons per dimension) and the detection speed will be very 
low. We want to have the dimension of the neural lattice much lower so we have to 
reduce the dimension of the observation vector. We can achieve this goal by three 
methods: (i) reducing the length of the observation interval; (ii) preprocessing the 
observation vectors; (iii) using the multi-net decomposition method as described in 
Section 4.13. 


1. We reduce the window length to 100 samples. 

2. To detect the typical form of a spike we must observe the EEG very accurately. 
For the detection of the slow wave we can take the mean of the sample values in 
successive intervals. For our observation vector we take the first thirty samples 
for the first thirty components. The remaining interval of the window is divided 
into seven subintervals. The mean of the ten samples in the seven subintervals 
gives us the next seven components of the observation vector (see Figure 4.73). In 
this way the dimension of the observation vector becomes thirty-seven. 

3. We divide the observation vector v; into two vectors: v; and Yw; The vector vy 
contains the first thirty components of v;, and the vector v,,; contains the remaining 
seven components. We take two neural networks: one network, called the spike 
network, is trained with the vectors v,, the second network, the ware network, is 
trained with vectors v,,;. 


For the training we used an ELEG recording of about | min duration. The recording 
contained forty-two SWCs. A neurologist indicated the beginning of nineteen out of 
these forty-two SWCs. The observation vectors were obtained by placing the window 
at 5000 random positions in the 1 min recording with the restriction that about 35 





EEG signal analysis 239 


10 | 10 10 | 10 10 x 10 samples 
| | 1] 1 


1j 1 


First 30 10 x mean of 10 


| | 


Spike net Wave net 





Figure 4.73 Outline for obtaining thirty-dimensional vectors for training the 
spike-net and ten-dimensional vectors for training the wave-net 


per cent of the observations were obtained by locating the beginning of the window 
at the beginning of one of the nineteen marked SWCs. 

We trained a two-dimensional 6 x 6 neural net (the spike network) with the vectors 
y,, and a two-dimensional 6 x 6 net with the vectors Yw; After learning, the weight 
vectors will represent (vector quantization) the set of training vectors. In Figure 4.74 
we have given the reconstructed signal form of the weight vectors of the 6 x6 spike 
neural net. Several of the weight vectors of the spike neural net represent the 
observation of a spike of an SWC and others represent the observation of an arbitrary 
party of the EEG signal that is not an SWC. The same holds for the waves of the SWC 
represented by the weight vectors of the wave network. 

In order to know which weight vectors in the spike net represent spikes, we have 
to label the weight vectors. After learning, we take the nineteen observation vectors 
obtained by observing the marked SWCs in the registration. We determine for each 
of these observation vectors the weight vector with the smallest distance to the 
observation vector and label that weight vector as a spike weight vector. The same 
is done for the wave neural net. 

After the training and labelling phase, we can use the neural net to detect SWCs 
in an EEG recording. We move with the window along the EEG signal and at each 
step we present the observation vector v,; to the spike net, and the vector v,,; to the 
wave net. If in the spike net and in the wave net the labelled weight vectors are 
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Figure 4.74 Signal segments corresponding to the weight vectors in the 
trained two-dimensional spike-net with 6 x 6 neurons 


simultaneously the winning weight vectors, then the observed signal segment is 
classified as an SWC. 

In the experimental setup we found that forty of the forty-two SWCs were detected 
and eight observations (signal forms resembling an SWC) were wrongly classified as 
an SWC. 


4.17 Application of the self-organizing algorithm to speech 
recognition 


The ultimate goal of a speech recognition system is the automatic conversion of 
recorded speech sound into corresponding written text. The most promising approach 
to large vocabulary automatic speech recognition is to build a recognizer for the 
smallest linguistic units that can occur in words: the phonemes. Subsequently one 
has to transform the resulting string of phonemes into words and at a still higher 
level into semantic knowledge. The number of phonemes is small (+60) compared 
to the number of words in some vocabulary. In Table 4.1 we have given the phonemes 
used in the so-called TIMIT database, containing the sound recordings of ten sentences 
spoken by 630 speakers. The acoustic signal in the database is labelled with the 
corresponding sequences of phonemes. The speech signal was sampled at 16 kHz 
with a 16-bit analog/digital converter. 

When we observe the acoustic signal corresponding with some phoneme (duration 
0.1 sec), then the time-dependent frequency spectrum is to a certain extent 
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Table 4.1 List of phonemes 





Phone Example Phone Example Phone Example 
fiy/ beat /er/ bird Izi z00 
/ih/ bit /axt/ diner /zh/ measure 
/eh/ bet /el/ bottle /v/ very , 
/ae/ bat /em/ yesem A fief 
/ux/ beauty /en/ button /th/ thief 
/ix/ roses /eng/ Washington /s/ sis 
/ax/ the /m/ mom /sh/ shoe’ 
/ah/ butt /n/ non /hh/ hay ` 
Juw/ boot /ng/ sing /hv/ Leheigh 
/uh/ book /ch/ church /pel/ (p closure): 
/ao/ bought /jh/ judge /tel/ (t closure) 
/aa/ cot /dh/ they /kcl/ (k closure) 
ley/ bait /b/ bob /qel/ _ (q closure) 
/ay/ bite /d/ dad /bel/ (b closure) 
/oy/ boy /dx/ (butter) /det/ (d closure) 
/aw/ about /nx/ (flapped n) /gel/ (g closure) 
Jow/ boat /g/ gag /epi/ (epi closure) 

NM led /p/ pop /h#/ (begin sil) 
/t/ red /t/ tot /#h/ (end sil) 
lyi yet /k/ kick /pau/ (between sil) 
/w/ wet /q/ (glottal stop) i 


O aa aaa 


characteristic for that particular phoneme. This phenomenon can be captured by 
observing the speech sound through a window shorter (8 msec = 128 samples) than 
the smallest phoneme, shifting the window by discrete steps along the acoustic signal, 
and determining at each step the frequency spectrum of the signal in the window. 
The coefficients of the frequency spectrum (or some transformation thereof) 
corresponding with a window observation on some phoneme signal constitute a 
spectral vector. 

We will now describe in a simplified experiment how we can use a self-organizing 
neural algorithm to learn to recognize a spoken sentence. The experiment will not 
reveal the quality of the recognition performance but will only reveal the idea of how 
a self-organizing neural can be used for speech recognition. 


Learning with the multi-net method 
We can use the self-organizing neural net algorithm to learn the time-dependent 


frequency characteristic of a phoneme. We use the multi-net method described in 
Section 4.13. For each phoneme we use a two-dimensional neural net of 5 x 5 neurons. 
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From the acoustic signal of a spoken sentence of some speaker (labelled with 
phonemes) we make a set of all spectral vectors obtained by placing the window 
somewhere on the acoustic signal. For each phoneme we collect all spectral vectors 
obtained by observing with our window the acoustic signal of that particular phoneme 
occurring in different positions‘in the sentence. Such a set will be called a phoneme 
spectral set. For each phoneme we train a ‘separate neural net, called a phoneme net, 
by randomly selecting 10000 times a vector from the pertinent phoneme spectral 
vector set. After training, due to the vector quantization property, each neural net 
will represent by its twenty-five weight vectors the most common spectral vectors of 
the particular phoneme. l 


Labelling the sequence of winning neurons 


After training, we pass the observation window again step by step along the sentence, 
and for each phoneme interval we register the sequence of winning neurons (i.e. which 
neuron in which neural net) for the applied sequence of spectral vectors. Each time 
the window passes the same phoneme the sequence of winning neurons will be similar 
and almost all winning neurons will be located in the phoneme net that corresponds 
to the phoneme we are scanning. f 

For those acquainted with hidden Markov models, it should be noted that we 
could construct a phoneme characteristic hidden Markov model from those sequences 
of winning neurons. However, we can also use a cruder method to characterize a 
phoneme. When passing a phoneme with our window it turns out that the number 
of times the winning neuron will be located in the corresponding phoneme net exceeds 
a certain minimum. (There may be incidental interruptions that the winning neuron 
will be located in a phoneme net different from the phoneme we are scanning.) We 
can use this minimum score as a criterion for the detection of a certain phoneme. 


Results of a simple experiment 


We used the following single sentence spoken by one speaker to learn and to test 
the recognition by the self-organizing neural net: ‘She had your dark suit in greasy 
wash water all year. The acoustic signal was labelled by phonemes. 

We used a simplified version of the set of phonemes, as should become clear from 
the sequence of phonemes as attached to the above spoken sentence, represented in 
Tables 4.2 and 4.3. In parentheses we give the sample numbers of the beginning and 
ending of some phoneme. (The time between two successive samples is about 
0.06 msec.) Because we move the window in the test phase with steps of ten samples, 
we have also given the duration of a phoneme in intervals of ten samples. The final 
number in the row (after the arrow) gives the number of times the winning neuron 


was in the phoneme net that corresponds to the phoneme we are observing with 
steps of ten samples. 
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Table 4.2 Test result 


sh as in She (00000-01666) duration 166.6 x 10 samples +146 
iy as in She (01666-02626) duration 96.0 x 10 samples +103 
hv as in had (02626-03446) duration 82.0 x 10 samples >78 
ae as in had (03446-05285) duration 183.9 x 10 samples 3179 
*the phoneme (almost silence) at the beginning of the d in ‘had’ 
(05285-05826) duration 54.1 x 10 samples +50 

d as in had (05826-06213) duration 38.7 x 10 samples 28 
y as in your (06213-06642) duration 42.9 x 10 samples 329 
er as in your (06642-07986) duration 134.4 x 10 samples 128 
*the phoneme (almost silence) at the beginning of the d in ‘dark’ : 

. (07986-09065) duration 107.9 x 10 samples — 109 
dd as in dark (09065-09266) duration 20.1 x 10 samples >0 
aa as in dark (09266-12159) duration 289.3 x 10 samples 274 


*the phoneme (almost silence) at the beginning of the k in ‘dark’ 


(12159-12866) duration 70.7 x 10 samples +64 
kk as in dark (12866-13146) duration 28.0 x 10 samples : 29 
ss as in suit (13146-14997) duration 185.1 x 10 samples > 182 
uw as in suit (14997-1705!) duration 204.5 x 10 samples — 188 


*the phoneme (almost silence) at the beginning of the t in ‘suit’ 


(17051-17306) duration 25.5 x 10 samples >23 
tt as in suit (17306-17588) duration 28.2 x 10 samples >14 
ix as in in (17588-18601) duration 101.3 x 10 samples — 88 
nn as in in (18601-19574) duration 97.3 x 10 samples +105 
*the phoneme (almost silence) at the beginning of the g in ‘greasy’ 

(19574-20546) duration 97.2 x 10 samples >72 
g as in greasy (20546-21506) duration 96.0 x 10 samples 393 
r as in greasy (21506-22013) duration 50.7 x 10 samples 349 
iy as in greasy (22013-23026) duration 101.3 x 10 samples — 100 
s as in greasy (23026-25026) duration 200.0 x 10 samples 3185 
iy as in greasy (25026-25943) duration 91.7 x 10 samples 377 
w as in wash (25943-28199) duration 225.6 x 10 samples . 222 
ao as in wash (28199-29828) duration 162.9 x 10 samples 150 
sh as in wash (29828-31373) duration 154.5 x 10 samples 141 
*the stop at the end of wash 

(31373-32130) duration 75.7 x 10 samples 774 


i 


We observe that for almost every phoneme, 90 per cent of the observations with 
our window result in a winning neuron in the correct phoneme net. An exception is 
the phoneme dd in ‘dark’ where the dd-phoneme net did not respond. A closer look 
reveals that the similar t-phoneme net responded eight times to the dd-phoneme. 

Most spectral vectors of the observations of a phoneme are assigned to the same 
neural net and will traverse a path of winning neurons in that net. Occasionally the 
path may temporarily jump to another phoneme net as illustrated in Figure 4.75 for 
the twenty-eight observations on the phoneme /t/ in the word ‘suit’. In the figure the 


ve 
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Table 4.3 Continued test results 





w as in water (32130-32773) duration 64.3 x 10 samples 67 
ao as in water (32773-34186) duration 141.3 x 10 samples 130 
d as in water (34186-34646) duration 48.0 x 10 samples 332 
aa as in water (34666-36226) 7 duration 156.0 x 10 samples 140 
q glottal stop at the end of water 

(36226-37097) duration 87.1 x 10 samples 87 
ao as in all (37097-39729) duration 263.2 x 10 samples 3235 
las in all (39729-40689) duration 96.0 x 10 samples 394 
y as in year (40689-42008) duration 131.9 x 10 samples 130 
ih as in year (42008-43710) duration 116.4 x 10 samples 156 
axr as in year (43710-44874) duration 116.4 x 10 samples 117 


*the phoneme representing silence at the end of the sentence from 44874 up to 57466 
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Figure 4.75 The path of winning neurons in the ‘t-phoneme net’ and 


‘s-phoneme net’ by passing the window across the acoustic signal 
of the ‘t-phoneme’ 


t-phoneme net and the s-phoneme net are schematically drawn. In both nets the 5 x 5 
neurons are represented by dots. For the first four observations, neuron 17 in the 
t-phoneme net is the winning neuron. For the fifth observation neuron 8 is the 
winning neuron. The next observation is assigned to neuron 4, then for six observations 
neuron 14 is the winning neuron. For the next seven observations we jump to 
the s-phoneme net where neuron 4 will be the winner, then we return again to the 
t-phoneme net where neurons 9 and 10 will be the winners, etc. 

When we use a minimum score of fourteen successive winning neurons in the same 
phoneme net, with interrupting jumps to another phoneme net with at most a length 
of ten successive winning neurons, then we can detect the correct sequence of phonemes 
in the given sentence with one exception. 


eaten Aen 
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Note: We used some kind of preprocessing of the observation of the 128 samples 
in our window. The window sample vectors x,, were first multiplied with a Hanning . 


window Wp, X=Xw"Xp, With: 


Xa = 4[1 — cos 2nn/(N —1)] 


Next a discrete Fourier transform is applied to x, resulting in a power spectrum 
vector p. Finally we reduced the vector to sixteen components with the logarithmic 
frequency Melscale. 


4.18 Selecting and scaling of training vectors 


Suppose our data set Dy € V of different d-dimensional data vectors v,eR? is generated 
in a sequence of steps. At each step a vector V; is generated according to some 
probability density function f(v) in a d-dimensional input space V. The obtained 
sequence (length M) of vectors will be called the training sequence T, and the 
consecutive elements will be indexed by a superscript: T, =v}, v?,..., Y“. The same 
vector v; may thus occur more than once in the training sequence Ty, say m; times. 
Therefore we have p(v'eT,)=1/M_and p(v,eT,)=m/M (= f(v) dy). We found in 
Section 4.4 that if we select at step t of the learning process a vector v(r)=v; with 
probability p(v;) from the data set Dy and change the weight vectors according 


to the algorithmic adaptation rule, then we are minimizing the quantization error in 
the final phase of learning: 


1 


EW)= 5 


£ > p(v,)IV;— wil $ 


w, vieR(w,) 
vEeD, 
During training, each vector v; must be presented several times (preferably > 10 times) 
to the neural algorithm. If, during training, the training sequence T, contains enough 
input vectors to present a vector v; (i.e. m,> 10 for all i), the choice of an input vector 
from T, at learning step £ is not so difficult: choose v(t)=‘the tth element of Ty’. We 
must, however, be aware that at certain time Tp there will be almost no adaptation 
of weights in the neural net. If we use the function e(t)=(1 +t)~! for the learning 
rate e(t), then after 10000 steps the adaptation of weight vectors will be 1 per cent; 
certainly there is no need to proceed with training to more than 100000 steps. If we 
need a larger value of Ty we have to use another function for e(t). 

A problem arises if M « Tp. In that case we draw randomly an element of Ty, 
v(t)=v*, with k an uniformly distributed integer 1 <k <M. The ordering of weights 
in the neural net depends on the particular sequence of applied input vectors. To 
obtain a result independent of the applied sequence we must select the region 
adaptation function h(r, s, t) such that the whole training set is presented several times 
before the quantization phase [h(r, s, t)= 1 for r=s and A(r, s, t)=0 otherwise] starts. 

Another way of drawing input vectors is y(t)=v' ™¢™ and as soon as t=M the 
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training set is permuted randomly. This method has the advantage that every input 
vector occurs p times if T;=px M. 

If the dimension of the input vectors is larger than the dimension of the neural 
net, the ordering and quantizatign is dominated by the components of the input 
vectors with the largest variance in the component values. In determining the ‘winning 
neuron’ during training, small differences in component values of v and w will be 
overruled by components with large differences. This implies that we have to scale 


the input vectors in such a way that the range of all components is almost the same. 
This can be done by the following method: 


1. Determine for each component i of the input vectors x, of the data set: 
Xing = MAX, {xj} and x; = min, {xj}. 

2. Determine for each i: d= max{x, — x; }. 

3. Determine d,,,,=max;,{d;}. ia 

4. Multiply the ith component x,; of all vectors x; by d,,,.,/d;. 


This procedure will bring all components in the same range dmax: If after training we 
need the values of the weight vectors, we have to multiply the ith component of a 
weight vector by d,/d 


max’ 


4.19 Some practical measures of performance of the 
self-organizing neural net algorithm 


In order to evaluate the performance of vector quantization by the self-organizing 
algorithm, we can calculate, after learning, the energy of quantization noise defined 
as follows (see Section 4.4): 


l 
Eow= g2 > lv; — wl? 


w, vVER(w,) 
veD, 


with M the number of elements in D,. 
The quantization noise is defined by: 


1 
Ow= E E maw 


w, veR(w,) 
ved, 


If we use the training sequence T,, as defined in Section 4.18, to evaluate Epy and 
Qpw, we obtain respectively: 


1 
Foy == 2. YX leave = wal? 


w, veR(w, 
v,eD, 
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with M the number of elements in T,, and 


1 
Qow= 7d 2 p(v,)\¥;— Wel 


w, vieR(w,) ore 


veDy 

Furthermore the vector quantization of the self-organizing algorithm will ‘try’ to: = f 
preserve topology (see Section 4.11). Thus we also want to have a measure to evaluate, 
the topology preservation. 

For topology preservation we want the mutual relative position of two vectors V; 
and v; in the input space V to be to a certain extent preserved by the mutual relative 
position of the corresponding winning neurons YP(v;) and ‘P(v,) in the neural lattice. 

A mapping from a metric space V with distance measure dy to a metric space L 
with distance measure d, is metric preserving if the triangular inequality property is 
preserved; that is, if 


dy(v,, v) < dy(v,, v,) + dv, v,) 
then 


d (Ww, WN] <4 H(v,), YO) + THs), WWI 


For metric preservation it is required that if the distance between two vectors from 
V is small, they will be represented by the same neuron or by neighbouring neurons 
in the neural net. This requirement can be stated as follows: if two receptive fields 
R(w,) and R(w,) in V are adjacent {the common border dR, = ROWJAR(w,) is not 
empty], then neuron u; with weight vector w; is a neighbour of neuron u; with weight 
vector w,. If this property holds for all receptive fields, we say that the feature mapping 
is topology preserving. 

If w is the projection from the weight space W to the neural lattice L, then we can 
consider for a practical measure of topology preservation the distance in the neural 
lattice between w(w,) and w(w,) if the corresponding receptive fields R(w;) and R(w,) 
are adjacent (i.e. they have a common border). For optimal topology preservation 
this distance would always be 1. If there is a distortion of topology preservation, 
then the distance c(w,) and (w,) will be larger than 1. So we can use the following 
estimation of topological energy as a practical measure for topology preservation: 


1 
Tew = ee 2 di {aly;), av ;)} 


j=i+1 
R(w,) adjacent R(w,) 
with b the number of borders of adjacent receptive fields in the input space V. 
The topological noise is defined as: 


Jew= 5d l 5 d,{alv), olv ;)} 


j=it+l 
R(w,}adjacent R(w,) 


A simple (but not perfect way) to determine whether two receptive fields R(w,) and 
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Figure 4.76 Two adjacent receptive fields where (w,+w,)/2 is not in 
R(w)UR(w,) 


R(w,) are adjacent or not is to check whether: 


ME MeRIw Rw) 


It may occur, however, that two receptive fields are adjacent for a ‘small’ common 
border and the expression above is not true, as illustrated in Figure 4.76. 


Example 4.14 


As a training sequence 1000 two-dimensional input vectors were randomly chosen 
from an input space V with a uniform distribution of vectors in the domain (0, 1]’. 
The neural lattice was one-dimensional with 100 neurons, and so perfect topology 
preservation is not possible. The number of training steps was: T;=2'’. In 
Figure 4.77 we have given the quantization noise Qpw as a function of the learning 
steps. The initial quantization noise is low because an initial random distribution of 
weight vectors is already a good representation of randomly distributed input vectors. 
After two steps the quantization noise is a monotone decreasing function of t. 

In Figure 4.78 we have given the value of the topological noise Jey as a function 
of the training steps. We observe an initial decrease of Jgw from initial disorder to 
perfect topology preservation at t=2*. The topology noise increases again starting 
at t= 27. The reason for this increase can be explained as follows. In the first relatively 
short period, the ordering phase, ordering and topology preservation takes place. In 
the quantization phase [h(r, s, t)=1 for r=s and f(r, s, t)=0 for r#s] the ordering is 
overruled by the quantization (see also Sections 4.4 and 4.11). The quantization noise 
is decreasing whereas the topology noise is increasing. r | 
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Figure 4.77 Quantization noise during training of a one-dimensional net 
with 100 neurons with 1000 two-dimensional uniformly 
distributed input vectors 
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Figure 4.78 Topological noise during training of a one-dimensional net with 
100 neurons with 1000 two-dimensional uniformly distributed 
input vectors 


Other simple measures of performance are as follows: 


@ Number of geometric neighbours | (lines): the number of pairs of neurons in the 


lattice with distance d, =1. 
© Number of effective close neighbours c (connections): the number of pairs of adjacent 
receptive fields which belong to neurons with distance d, =1. 
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Figure 4.79 Two-dimensional input/weight space for a onc-dimensional 
neural net of ten neurons. Areas are the receptive fields 


@ Number of conflicting close neighbours d (defects): the number of pairs of receptive 
fields that are not adjacent which belong to neurons with distance d, = 1. 

@ Number of effective neighbours b (borders): the number of pairs of adjacent receptive 
fields. 

@ Number of effective distant neighbours j (jumps): the number of pairs of adjacent 
receptive fields which belong to neurons with a distance d;> 1. 


Note: l=c+d and b=j+c. 


Example 4.15 


In Figure 4.79 we have given a two-dimensional input space V. The dots represent 
the weights of a one-dimensional neural lattice with 10 neurons. We find for 
Figure 4.79: 1=9, c=8, d=1, b=18 and j=10. |] 


In Section 4.11 we found that if the dimension of the input space V is the same as 
the dimension of the neural lattice L, then topology will be preserved. This implies 
that if the distance between two input vectors approaches zero, then the distance 
between the corresponding winning neurons is at most equal to 1. If, however, the 
dimension m of the input space is larger than the dimension n of the neural lattice, 
then in general there will be no topology preservation, and if the distance between 
two input vectors approaches zero, then the distance between the corresponding 
winning neurons might become large (>> 1). A continuous path in V results in a 


corresponding track in the lattice L with in gencral large jumps between the successive 
winning neurons. 
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We found, however, that we can obtain topology preservation in a restricted sense 
if the training set D consists of m-dimensional vectors, whereas the neural lattice has 
a dimension n much smaller than m. This will be the case if the components of the 
m-dimensional vectors of the training ‘set D are interrelated such that they can be 
represented by points in an n-dimensional space while preserving the topology for 
the elements of D. . 

With the measure of J gw of topological noise we demonstrate this mechanism by 
a simple experiment. 


Example 4.16 


In an experiment we used nine-dimensional training vectors. A number of the 
components of the training vector were independent by selecting for each of them a 
random number in the interval [0, 1]. The dependent components were just copies 
of the independent elements. We used six different ncural network dimensions: 
one-dimensional (220 neurons), two-dimensional (1 5 x 15 neurons), three-dimensional 
(6x 6x6 neurons), four-dimensional (4x4x4x4 neurons), five-dimensional 
(3x 3x 3x3 x3 neurons), eight-dimensional (2 x 2 x 2 x 2 x 2x2x2x2 neurons). 

In Figure 4.80 we have given the results for the topological noise Jew for the 
different ncural nets. We observe that a minimum of the topological noise is obtained 
if the number of independent components is equal to the dimension of the neural 
network. Note that for optimal topology preservation, the value of topological noise 
is Jew=l. E 


4.20 Application of the self-organizing algorithm to signature 
identification 


We will end this book with a nice and powerful application of the self-organizing 
neural net algorithm. 

In many situations the signature of a person will authorize some official certificate. 
However, after some training, the pattern of a signature can be duplicated. To avoid 
the falsification of signatures many corporations would like to have means to 
check the authenticity of a signature. 

From the static pattern of a signature alone one cannot deduce its authenticity. 
However, the dynamic pattern of writing down a signature is far more difficult to 
copy. The speed of making the curves in a signature is quite characteristic for a 
person and hard to imitate. 

We can use the self-organizing algorithm to solve this problem. We will only give 
a global outline of the implementation because the concept will be patented. 

First we have to store the dynamic and static characteristics of a signature. For 
that purpose a person is asked to write down his or her signature on an electronic 
notepad several times. For each signature the. position of the pencil is sampled at 
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Figure 4.81 The author's signature with sample observation points 


& 


Figure 4.82 The observation of the signature with a window at the fifth 
sample point 


small equidistant moments of time (+0.1 sec). Figure 4.81 presents the signature of 
the author; the twenty dots give the positions of the sample points. 

Now we use the observation window introduced in Section 4.2, as was used for 
character recognition in Section 4.10. At each sample point we observe through the 
window the pattern of the signature but only for that part of the signature as 1s 
actually written down at the moment of sampling. In Figure 4.82 we observe the 
first part of the signature at the fifth sample point. In Figure 4.83 we observe a larger 


part of the signature at the tenth sample point. 
Each observation will give us a observation vector of seventeen components. Each 


component reflects the mean grey value in a field of the window (see Sections 4.2 
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Figure 4.83 The observation of the signature with a window at the tenth 
sample point 


and 4.10). For the signature of Figure 4.81 we obtain twenty different observation 
vectors. If we have ten examples of the signature, we obtain in this way 200 
observations. We train a self-organizing two-dimensional neural net with 5 x 5 neurons 
with the 200 observation vectors. Because of the vector quantization performed by 
the self-organizing algorithm, we obtain twenty-five weight vectors representing 
clusters of similar observations. The examples of the signature are now used again 
to determine the dynamics of the signature. For cach signature we have a sequence 
of twenty observation vectors. We present the sequence-of observation vectors of a 
signature again to the neural net and determine the corresponding sequence of ‘winning 
neurons’. The sequence of winning neurons is stored in a file. The same will be done 
for the remaining set of nine examples of signatures. The obtained (similar) sequences 
of winning neurons will reflect the dynamics of a signature. If, for example, a part 
of the signature is written very slowly, the same neuron will be the winning neuron 
for several consecutive observations, because almost the same observation is presented 
several times to the neural net. The transition from a winning neuron to the next 
neuron represents the movement of the pencil from one part of the signature to the 
next part. 


We can do some statistical analysis on the obtained sequences of winning neurons 
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in order to determine the probability of a sequence of winning neurons for that 


particular signature. FY 
After this learning phase we can use the neural net to identify with some probability 
the authenticity of a signature written down on an electronic notepad. 


4.21 Exercises 


1. Assume we have a two-dimensional neural network with nine neurons uj. The 


coordinate vectors i, of the neurons in the lattice are: iy =[0, 0], i, =[1, 0], 
i;=(2, 0], ig =(0, 1], is =[1, 1], i6 =[2, 1], tp =[0, 2], ig = (1, 2], ig = (2, 2]. 
At a certain time step t the weight vectors wj are: w, =[4, 5], w.=[5, 3], 
w,=[4, 6], w,=[1, 1], ws =[2, 2], W6=[0, 0], w- = £6, 4], Ws =3, 5], wo=[5, 4]. 
At time t we have an input vector v=(3, 3]. What will be the value of the 
weights at time £+ 1 if we apply the algorithmic adaptation rule, if a(t) = 1 and 
A(r, s, t)=1 if ji, -i,] =0, h(r, s, t)=0.5 if li, — i= 1 and A(r, s, t)=0 otherwise? 
2. What will be the value of the root mean squared error E(W): 


EW)=2EE, E pieh, s, dv= w? 
2 w, w, VER(w,) 

for the data set D = {[3, 3]} with p([3, 3])=1 before and after adaptation for the 
self-organizing neural net of exercise 1 if e(t)=e(t+ 1) and A(r, s, t)=Atr, s,t+1)? 

3. How could we arrange the training process such that the solution route of the 
travelling salesman problem (see Section 4.6) will start ina given town? ; 

4. We have a two-dimensional input data set Dy with data uniformly distributed in 
the domain [0.5, 3.5] x [0.5, 3.5]. We use a two-dimensional self-organizing neural 
with nine neurons. The coordinate vectors i; of the neurons in the lattice 
are: i, =[0, 0], i,=[1, 0], i; =[2, 0], i4=[0, 1], is=C1, 1], ig =[2, 1], i, =(0, 2], 
i Wine oad e k final value of the weights w; of all neurons if the algorithm 
performs a perfect vector quantization? 

5. The same problem as in exercise 4 but now with a one-dimensional neural net. 
The coordinate vectors i; of the neurons in the lattice are: i, =[1],i,=(2], i; = [3], 
i, =[4], is =[5], is =(6], ip =[7], is = [8], io =[9]. 
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